linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Daniel Henrique Barboza <danielhb413@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error
Date: Mon, 19 Apr 2021 13:59:03 +1000	[thread overview]
Message-ID: <YH0ABwVgA/+iA+Fd@yekko.fritz.box> (raw)
In-Reply-To: <20210416210216.380291-1-danielhb413@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4958 bytes --]

On Fri, Apr 16, 2021 at 06:02:14PM -0300, Daniel Henrique Barboza wrote:
> At this moment, PAPR [1] does not have a way to report errors during a device
> removal operation. This puts a strain in the hypervisor, which needs extra
> mechanisms to try to fallback and recover from an error that might have
> happened during the removal. The QEMU community has dealt with it during these
> years by either trying to preempt the error before sending the HP event or, in
> case of a guest side failure, reboot the guest to complete the removal process.
> 
> This started to change with QEMU commit fe1831eff8a4 ("spapr_drc.c: use DRC
> reconfiguration to cleanup DIMM unplug state"), where a way to fallback from a
> memory removal error was introduced. In this case, when QEMU detects that the
> kernel is reconfiguring LMBs DRCs that were marked as pending removal, the
> entire process is reverted from the QEMU side as well. Around the same time,
> other discussions in the QEMU mailing discussed an alternative for other device
> as well.
> 
> In [2] the idea of using RTAS set-indicator for this role was first introduced.
> The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is already
> UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else for both QEMU
> and phyp. This gives us an opportunity to use this behavior to signal the
> hypervisor layer when a device removal happens, allowing it to do a
> proper

Nit: it's not when a device removal happens, but when it *fails* to happen.

> error handling knowing for sure that the removal failed in the kernel. Using
> set-indicator to report HP errors isn't strange to PAPR, as per R1-13.5.3.4-4.
> of table 13.7 of [1]:

> "For all DR options: If this is a DR operation that involves the user insert-
> ing a DR entity, then if the firmware can determine that the inserted entity
> would cause a system disturbance, then the set-indicator RTAS call must not
> unisolate the entity and must return an error status which is unique to the
> particular error."
> 
> PAPR does not make any restrictions or considerations about setting an already
> Unisolated/Configured DRC to 'unisolate', meaning we have a chance to use it
> for this purpose - signal an OS side error when attempting to remove a DR
> entity.  To validate the design, this is being implemented only for CPUs.
> 
> QEMU will use this mechanism to rollback the device removal (hotunplug) state,
> allowing for a better error handling mechanism. A implementation of how QEMU
> can do it is in [3]. When using a kernel with this series applied, together
> with this QEMU build, this is what happens in a common CPU removal/hotunplug
> error scenario (trying to remove the last online CPU):
> 
> ( QEMU command line: qemu-system-ppc64 -machine pseries,accel=kvm,usb=off
> -smp 1,maxcpus=2,threads=1,cores=2,sockets=1 ... )
> 
> [root@localhost ~]# QEMU 5.2.92 monitor - type 'help' for more information
> (qemu) device_add host-spapr-cpu-core,core-id=1,id=core1
> (qemu) 
> 
> [root@localhost ~]# echo 0 > /sys/devices/system/cpu/cpu0/online
> [   77.548442][   T13] IRQ 19: no longer affine to CPU0
> [   77.548452][   T13] IRQ 20: no longer affine to CPU0
> [   77.548458][   T13] IRQ 256: no longer affine to CPU0
> [   77.548465][   T13] IRQ 258: no longer affine to CPU0
> [   77.548472][   T13] IRQ 259: no longer affine to CPU0
> [   77.548479][   T13] IRQ 260: no longer affine to CPU0
> [   77.548485][   T13] IRQ 261: no longer affine to CPU0
> [   77.548590][    T0] cpu 0 (hwid 0) Ready to die...
> [root@localhost ~]# (qemu) 
> (qemu) device_del core1
> (qemu) [   83.214073][  T100] pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16
> qemu-system-ppc64: Device hotunplug rejected by the guest for device core1
> 
> (qemu) 
> 
> As soon as the CPU removal fails in dlpar_cpu(), QEMU becames aware of
> it and is able to do error recovery.
> 
> If this solution is well received, I'll push for an architecture change
> request internally at IBM to make this mechanism PAPR official.
> 
> 
> [1] https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
> [2] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html
> [3] https://github.com/danielhb/qemu/tree/unisolate_drc_callback_v1
> 
> Daniel Henrique Barboza (2):
>   dlpar.c: introduce dlpar_unisolate_drc()
>   hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
> 
>  arch/powerpc/platforms/pseries/dlpar.c       | 14 ++++++++++++++
>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  9 ++++++++-
>  arch/powerpc/platforms/pseries/pseries.h     |  1 +
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2021-04-19  4:02 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-16 21:02 [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error Daniel Henrique Barboza
2021-04-16 21:02 ` [PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc() Daniel Henrique Barboza
2021-04-19  4:01   ` David Gibson
2021-04-16 21:02 ` [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure Daniel Henrique Barboza
2021-04-19  4:02   ` David Gibson
2021-04-19 12:48   ` Michael Ellerman
2021-04-19 13:13     ` Daniel Henrique Barboza
2021-04-20  3:57       ` Michael Ellerman
2021-04-19  3:59 ` David Gibson [this message]
2021-04-21 13:08 ` [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error Michael Ellerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YH0ABwVgA/+iA+Fd@yekko.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=danielhb413@gmail.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).