qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813
@ 2019-08-13  6:59 David Gibson
  2019-08-13  6:59 ` [Qemu-devel] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices David Gibson
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: David Gibson @ 2019-08-13  6:59 UTC (permalink / raw)
  To: peter.maydell; +Cc: clg, David Gibson, qemu-ppc, qemu-devel, groug

The following changes since commit 5e7bcdcfe69ce0fad66012b2cfb2035003c37eef:

  display/bochs: fix pcie support (2019-08-12 16:36:41 +0100)

are available in the Git repository at:

  git://github.com/dgibson/qemu.git tags/ppc-for-4.1-20190813

for you to fetch changes up to 310cda5b5e9df642b19a0e9c504368ffba3b3ab9:

  spapr/xive: Fix migration of hot-plugged CPUs (2019-08-13 16:50:30 +1000)

----------------------------------------------------------------
ppc patch queue 2019-08-13 (last minute qemu-4.1 fixes)

Here's a very, very last minute pull request for qemu-4.1.  This fixes
two nasty bugs with the XIVE interrupt controller in "dual" mode
(where the guest decides which interrupt controller it wants to use).
One occurs when resetting the guest while I/O is active, and the other
with migration of hotplugged CPUs.

The timing here is very unfortunate.  Alas, we only spotted these bugs
very late, and I was sick last week, delaying analysis and fix even
further.

This series hasn't had nearly as much testing as I'd really like, but
I'd still like to squeeze it into qemu-4.1 if possible, since
definitely fixing two bad bugs seems like an acceptable tradeoff for
the risk of introducing different bugs.

----------------------------------------------------------------
Cédric Le Goater (1):
      spapr/xive: Fix migration of hot-plugged CPUs

David Gibson (1):
      spapr: Reset CAS & IRQ subsystem after devices

 hw/intc/spapr_xive_kvm.c | 19 +++++++++++++++++--
 hw/intc/xive.c           | 21 ++++++++++++++++++++-
 hw/ppc/spapr.c           | 24 ++++++++++++------------
 include/hw/ppc/xive.h    |  1 +
 4 files changed, 50 insertions(+), 15 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
  2019-08-13  6:59 [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 David Gibson
@ 2019-08-13  6:59 ` David Gibson
  2019-08-22 20:07   ` [Qemu-devel] [Qemu-ppc] " Laurent Vivier
  2019-08-13  6:59 ` [Qemu-devel] [PULL 2/2] spapr/xive: Fix migration of hot-plugged CPUs David Gibson
  2019-08-13  9:23 ` [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 Peter Maydell
  2 siblings, 1 reply; 11+ messages in thread
From: David Gibson @ 2019-08-13  6:59 UTC (permalink / raw)
  To: peter.maydell; +Cc: clg, David Gibson, qemu-ppc, qemu-devel, groug

This fixes a nasty regression in qemu-4.1 for the 'pseries' machine,
caused by the new "dual" interrupt controller model.  Specifically,
qemu can crash when used with KVM if a 'system_reset' is requested
while there's active I/O in the guest.

The problem is that in spapr_machine_reset() we:

1. Reset the CAS vector state
	spapr_ovec_cleanup(spapr->ov5_cas);

2. Reset all devices
	qemu_devices_reset()

3. Reset the irq subsystem
	spapr_irq_reset();

However (1) implicitly changes the interrupt delivery mode, because
whether we're using XICS or XIVE depends on the CAS state.  We don't
properly initialize the new irq mode until (3) though - in particular
setting up the KVM devices.

During (2), we can temporarily drop the BQL allowing some irqs to be
delivered which will go to an irq system that's not properly set up.

Specifically, if the previous guest was in (KVM) XIVE mode, the CAS
reset will put us back in XICS mode.  kvm_kernel_irqchip() still
returns true, because XIVE was using KVM, however XICs doesn't have
its KVM components intialized and kernel_xics_fd == -1.  When the irq
is delivered it goes via ics_kvm_set_irq() which assert()s that
kernel_xics_fd != -1.

This change addresses the problem by delaying the CAS reset until
after the devices reset.  The device reset should quiesce all the
devices so we won't get irqs delivered while we mess around with the
IRQ.  The CAS reset and irq re-initialize should also now be under the
same BQL critical section so nothing else should be able to interrupt
it either.

We also move the spapr_irq_msi_reset() used in one of the legacy irq
modes, since it logically makes sense at the same point as the
spapr_irq_reset() (it's essentially an equivalent operation for older
machine types).  Since we don't need to switch between different
interrupt controllers for those old machine types it shouldn't
actually be broken in those cases though.

Cc: Cédric Le Goater <clg@kaod.org>

Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend"
Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting
                 XIVE and XICS"
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 821f0d4a49..12ed4b065c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState *machine)
         spapr_setup_hpt_and_vrma(spapr);
     }
 
+    /*
+     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
+     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
+     * called from vPHB reset handler so we initialize the counter here.
+     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
+     * must be equally distant from any other node.
+     * The final value of spapr->gpu_numa_id is going to be written to
+     * max-associativity-domains in spapr_build_fdt().
+     */
+    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
+    qemu_devices_reset();
+
     /*
      * If this reset wasn't generated by CAS, we should reset our
      * negotiated options and start from scratch
@@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState *machine)
         spapr_irq_msi_reset(spapr);
     }
 
-    /*
-     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
-     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
-     * called from vPHB reset handler so we initialize the counter here.
-     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
-     * must be equally distant from any other node.
-     * The final value of spapr->gpu_numa_id is going to be written to
-     * max-associativity-domains in spapr_build_fdt().
-     */
-    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
-    qemu_devices_reset();
-
     /*
      * This is fixing some of the default configuration of the XIVE
      * devices. To be called after the reset of the machine devices.
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PULL 2/2] spapr/xive: Fix migration of hot-plugged CPUs
  2019-08-13  6:59 [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 David Gibson
  2019-08-13  6:59 ` [Qemu-devel] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices David Gibson
@ 2019-08-13  6:59 ` David Gibson
  2019-08-13  9:23 ` [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 Peter Maydell
  2 siblings, 0 replies; 11+ messages in thread
From: David Gibson @ 2019-08-13  6:59 UTC (permalink / raw)
  To: peter.maydell; +Cc: clg, David Gibson, qemu-ppc, qemu-devel, groug

From: Cédric Le Goater <clg@kaod.org>

The migration sequence of a guest using the XIVE exploitation mode
relies on the fact that the states of all devices are restored before
the machine is. This is not true for hot-plug devices such as CPUs
which state come after the machine. This breaks migration because the
thread interrupt context registers are not correctly set.

Fix migration of hotplugged CPUs by restoring their context in the
'post_load' handler of the XiveTCTX model.

Fixes: 277dd3d7712a ("spapr/xive: add migration support for KVM")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190813064853.29310-1-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/intc/spapr_xive_kvm.c | 19 +++++++++++++++++--
 hw/intc/xive.c           | 21 ++++++++++++++++++++-
 include/hw/ppc/xive.h    |  1 +
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 3bf8e7a20e..8898615c69 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -72,11 +72,17 @@ static void kvm_cpu_disable_all(void)
  * XIVE Thread Interrupt Management context (KVM)
  */
 
-static void kvmppc_xive_cpu_set_state(XiveTCTX *tctx, Error **errp)
+void kvmppc_xive_cpu_set_state(XiveTCTX *tctx, Error **errp)
 {
+    SpaprXive *xive = SPAPR_MACHINE(qdev_get_machine())->xive;
     uint64_t state[2];
     int ret;
 
+    /* The KVM XIVE device is not in use yet */
+    if (xive->fd == -1) {
+        return;
+    }
+
     /* word0 and word1 of the OS ring. */
     state[0] = *((uint64_t *) &tctx->regs[TM_QW1_OS]);
 
@@ -655,7 +661,16 @@ int kvmppc_xive_post_load(SpaprXive *xive, int version_id)
         }
     }
 
-    /* Restore the thread interrupt contexts */
+    /*
+     * Restore the thread interrupt contexts of initial CPUs.
+     *
+     * The context of hotplugged CPUs is restored later, by the
+     * 'post_load' handler of the XiveTCTX model because they are not
+     * available at the time the SpaprXive 'post_load' method is
+     * called. We can not restore the context of all CPUs in the
+     * 'post_load' handler of XiveTCTX because the machine is not
+     * necessarily connected to the KVM device at that time.
+     */
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index cf77bdb7d3..da148e9f6f 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -615,12 +615,31 @@ static int vmstate_xive_tctx_pre_save(void *opaque)
     return 0;
 }
 
+static int vmstate_xive_tctx_post_load(void *opaque, int version_id)
+{
+    Error *local_err = NULL;
+
+    if (kvm_irqchip_in_kernel()) {
+        /*
+         * Required for hotplugged CPU, for which the state comes
+         * after all states of the machine.
+         */
+        kvmppc_xive_cpu_set_state(XIVE_TCTX(opaque), &local_err);
+        if (local_err) {
+            error_report_err(local_err);
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
 static const VMStateDescription vmstate_xive_tctx = {
     .name = TYPE_XIVE_TCTX,
     .version_id = 1,
     .minimum_version_id = 1,
     .pre_save = vmstate_xive_tctx_pre_save,
-    .post_load = NULL, /* handled by the sPAPRxive model */
+    .post_load = vmstate_xive_tctx_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_BUFFER(regs, XiveTCTX),
         VMSTATE_END_OF_LIST()
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 55c53c7417..736335174a 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -438,5 +438,6 @@ void kvmppc_xive_source_set_irq(void *opaque, int srcno, int val);
 void kvmppc_xive_cpu_connect(XiveTCTX *tctx, Error **errp);
 void kvmppc_xive_cpu_synchronize_state(XiveTCTX *tctx, Error **errp);
 void kvmppc_xive_cpu_get_state(XiveTCTX *tctx, Error **errp);
+void kvmppc_xive_cpu_set_state(XiveTCTX *tctx, Error **errp);
 
 #endif /* PPC_XIVE_H */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813
  2019-08-13  6:59 [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 David Gibson
  2019-08-13  6:59 ` [Qemu-devel] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices David Gibson
  2019-08-13  6:59 ` [Qemu-devel] [PULL 2/2] spapr/xive: Fix migration of hot-plugged CPUs David Gibson
@ 2019-08-13  9:23 ` Peter Maydell
  2019-08-13 11:45   ` Peter Maydell
  2 siblings, 1 reply; 11+ messages in thread
From: Peter Maydell @ 2019-08-13  9:23 UTC (permalink / raw)
  To: David Gibson; +Cc: Cédric Le Goater, qemu-ppc, QEMU Developers, Greg Kurz

On Tue, 13 Aug 2019 at 07:59, David Gibson <david@gibson.dropbear.id.au> wrote:
>
> The following changes since commit 5e7bcdcfe69ce0fad66012b2cfb2035003c37eef:
>
>   display/bochs: fix pcie support (2019-08-12 16:36:41 +0100)
>
> are available in the Git repository at:
>
>   git://github.com/dgibson/qemu.git tags/ppc-for-4.1-20190813
>
> for you to fetch changes up to 310cda5b5e9df642b19a0e9c504368ffba3b3ab9:
>
>   spapr/xive: Fix migration of hot-plugged CPUs (2019-08-13 16:50:30 +1000)
>
> ----------------------------------------------------------------
> ppc patch queue 2019-08-13 (last minute qemu-4.1 fixes)
>
> Here's a very, very last minute pull request for qemu-4.1.  This fixes
> two nasty bugs with the XIVE interrupt controller in "dual" mode
> (where the guest decides which interrupt controller it wants to use).
> One occurs when resetting the guest while I/O is active, and the other
> with migration of hotplugged CPUs.
>
> The timing here is very unfortunate.  Alas, we only spotted these bugs
> very late, and I was sick last week, delaying analysis and fix even
> further.
>
> This series hasn't had nearly as much testing as I'd really like, but
> I'd still like to squeeze it into qemu-4.1 if possible, since
> definitely fixing two bad bugs seems like an acceptable tradeoff for
> the risk of introducing different bugs.

Are these regressions? Are they security issues?

We are going to have an rc5 today, but my intention was to only put in
the security-fix bug in the bochs display device, and then have
a final release Thursday.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813
  2019-08-13  9:23 ` [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 Peter Maydell
@ 2019-08-13 11:45   ` Peter Maydell
  2019-08-13 12:04     ` Cédric Le Goater
  2019-08-13 14:16     ` David Gibson
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Maydell @ 2019-08-13 11:45 UTC (permalink / raw)
  To: David Gibson; +Cc: Cédric Le Goater, qemu-ppc, QEMU Developers, Greg Kurz

On Tue, 13 Aug 2019 at 10:23, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Tue, 13 Aug 2019 at 07:59, David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > The following changes since commit 5e7bcdcfe69ce0fad66012b2cfb2035003c37eef:
> >
> >   display/bochs: fix pcie support (2019-08-12 16:36:41 +0100)
> >
> > are available in the Git repository at:
> >
> >   git://github.com/dgibson/qemu.git tags/ppc-for-4.1-20190813
> >
> > for you to fetch changes up to 310cda5b5e9df642b19a0e9c504368ffba3b3ab9:
> >
> >   spapr/xive: Fix migration of hot-plugged CPUs (2019-08-13 16:50:30 +1000)
> >
> > ----------------------------------------------------------------
> > ppc patch queue 2019-08-13 (last minute qemu-4.1 fixes)
> >
> > Here's a very, very last minute pull request for qemu-4.1.  This fixes
> > two nasty bugs with the XIVE interrupt controller in "dual" mode
> > (where the guest decides which interrupt controller it wants to use).
> > One occurs when resetting the guest while I/O is active, and the other
> > with migration of hotplugged CPUs.
> >
> > The timing here is very unfortunate.  Alas, we only spotted these bugs
> > very late, and I was sick last week, delaying analysis and fix even
> > further.
> >
> > This series hasn't had nearly as much testing as I'd really like, but
> > I'd still like to squeeze it into qemu-4.1 if possible, since
> > definitely fixing two bad bugs seems like an acceptable tradeoff for
> > the risk of introducing different bugs.
>
> Are these regressions? Are they security issues?
>
> We are going to have an rc5 today, but my intention was to only put in
> the security-fix bug in the bochs display device, and then have
> a final release Thursday.

After thinking about this and reading the commit messages I've
applied this pullreq, since it clearly only affects spapr and you're
in a better position to judge the significance of the fixes than me,
but it was really really borderline...

thanks
-- PMM


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813
  2019-08-13 11:45   ` Peter Maydell
@ 2019-08-13 12:04     ` Cédric Le Goater
  2019-08-13 14:16     ` David Gibson
  1 sibling, 0 replies; 11+ messages in thread
From: Cédric Le Goater @ 2019-08-13 12:04 UTC (permalink / raw)
  To: Peter Maydell, David Gibson; +Cc: qemu-ppc, QEMU Developers, Greg Kurz

On 13/08/2019 13:45, Peter Maydell wrote:
> On Tue, 13 Aug 2019 at 10:23, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> On Tue, 13 Aug 2019 at 07:59, David Gibson <david@gibson.dropbear.id.au> wrote:
>>>
>>> The following changes since commit 5e7bcdcfe69ce0fad66012b2cfb2035003c37eef:
>>>
>>>   display/bochs: fix pcie support (2019-08-12 16:36:41 +0100)
>>>
>>> are available in the Git repository at:
>>>
>>>   git://github.com/dgibson/qemu.git tags/ppc-for-4.1-20190813
>>>
>>> for you to fetch changes up to 310cda5b5e9df642b19a0e9c504368ffba3b3ab9:
>>>
>>>   spapr/xive: Fix migration of hot-plugged CPUs (2019-08-13 16:50:30 +1000)
>>>
>>> ----------------------------------------------------------------
>>> ppc patch queue 2019-08-13 (last minute qemu-4.1 fixes)
>>>
>>> Here's a very, very last minute pull request for qemu-4.1.  This fixes
>>> two nasty bugs with the XIVE interrupt controller in "dual" mode
>>> (where the guest decides which interrupt controller it wants to use).
>>> One occurs when resetting the guest while I/O is active, and the other
>>> with migration of hotplugged CPUs.
>>>
>>> The timing here is very unfortunate.  Alas, we only spotted these bugs
>>> very late, and I was sick last week, delaying analysis and fix even
>>> further.
>>>
>>> This series hasn't had nearly as much testing as I'd really like, but
>>> I'd still like to squeeze it into qemu-4.1 if possible, since
>>> definitely fixing two bad bugs seems like an acceptable tradeoff for
>>> the risk of introducing different bugs.
>>
>> Are these regressions? Are they security issues?
>>
>> We are going to have an rc5 today, but my intention was to only put in
>> the security-fix bug in the bochs display device, and then have
>> a final release Thursday.
> 
> After thinking about this and reading the commit messages I've
> applied this pullreq, since it clearly only affects spapr and you're
> in a better position to judge the significance of the fixes than me,
> but it was really really borderline...

I was going to reply but you were faster to apply. Here is some more
context.

The XIVE interrupt mode is activated by default in 4.1. So these are 
regressions w.r.t to the previous mode spapr was using referred as 
XICS. Specially the first patch.

The second patch is a fix for the restoration of the hot-plugged CPUs.
The restoration of the spapr machine became more complex with the 
XIVE interrupt controller because when we need the machine state to be 
loaded to know which KVM IRQ device to activate, XICS or XIVE. From 
there we can restore the KVM states and HW contexts of the different 
models in use, sources, controllers and presenters. 

The post_load handler of the spapr machine relies on the fact that 
it is called last and does the work for all models. I realized last 
evening that this is not true for hot-plugged CPUs which state come 
after the machine. I was under the assumption/impression this was 
not the case but I might be mistaken. It took me while to get the 
save/restore sequences correct. 


Thanks,

C.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813
  2019-08-13 11:45   ` Peter Maydell
  2019-08-13 12:04     ` Cédric Le Goater
@ 2019-08-13 14:16     ` David Gibson
  1 sibling, 0 replies; 11+ messages in thread
From: David Gibson @ 2019-08-13 14:16 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Cédric Le Goater, qemu-ppc, QEMU Developers, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]

On Tue, Aug 13, 2019 at 12:45:51PM +0100, Peter Maydell wrote:
> On Tue, 13 Aug 2019 at 10:23, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Tue, 13 Aug 2019 at 07:59, David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > The following changes since commit 5e7bcdcfe69ce0fad66012b2cfb2035003c37eef:
> > >
> > >   display/bochs: fix pcie support (2019-08-12 16:36:41 +0100)
> > >
> > > are available in the Git repository at:
> > >
> > >   git://github.com/dgibson/qemu.git tags/ppc-for-4.1-20190813
> > >
> > > for you to fetch changes up to 310cda5b5e9df642b19a0e9c504368ffba3b3ab9:
> > >
> > >   spapr/xive: Fix migration of hot-plugged CPUs (2019-08-13 16:50:30 +1000)
> > >
> > > ----------------------------------------------------------------
> > > ppc patch queue 2019-08-13 (last minute qemu-4.1 fixes)
> > >
> > > Here's a very, very last minute pull request for qemu-4.1.  This fixes
> > > two nasty bugs with the XIVE interrupt controller in "dual" mode
> > > (where the guest decides which interrupt controller it wants to use).
> > > One occurs when resetting the guest while I/O is active, and the other
> > > with migration of hotplugged CPUs.
> > >
> > > The timing here is very unfortunate.  Alas, we only spotted these bugs
> > > very late, and I was sick last week, delaying analysis and fix even
> > > further.
> > >
> > > This series hasn't had nearly as much testing as I'd really like, but
> > > I'd still like to squeeze it into qemu-4.1 if possible, since
> > > definitely fixing two bad bugs seems like an acceptable tradeoff for
> > > the risk of introducing different bugs.
> >
> > Are these regressions? Are they security issues?

They're effectively regressions.  Pedantically, they're bugs in a new
feature, but since the new feature is enabled by default in the new
machine type (and it's the interrupt controller, so you can't do
without it), so it means a normal setup will be broken where the
normal setup in the old version wasn't.

> > We are going to have an rc5 today, but my intention was to only put in
> > the security-fix bug in the bochs display device, and then have
> > a final release Thursday.
> 
> After thinking about this and reading the commit messages I've
> applied this pullreq, since it clearly only affects spapr and you're
> in a better position to judge the significance of the fixes than me,
> but it was really really borderline...

Fair enough.  As I said, the timing sucked, but there's not really
anything I could do about that.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
  2019-08-13  6:59 ` [Qemu-devel] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices David Gibson
@ 2019-08-22 20:07   ` Laurent Vivier
  2019-08-23  5:39     ` David Gibson
  0 siblings, 1 reply; 11+ messages in thread
From: Laurent Vivier @ 2019-08-22 20:07 UTC (permalink / raw)
  To: David Gibson, peter.maydell; +Cc: groug, qemu-ppc, clg, qemu-devel

On 13/08/2019 08:59, David Gibson wrote:
> This fixes a nasty regression in qemu-4.1 for the 'pseries' machine,
> caused by the new "dual" interrupt controller model.  Specifically,
> qemu can crash when used with KVM if a 'system_reset' is requested
> while there's active I/O in the guest.
> 
> The problem is that in spapr_machine_reset() we:
> 
> 1. Reset the CAS vector state
> 	spapr_ovec_cleanup(spapr->ov5_cas);
> 
> 2. Reset all devices
> 	qemu_devices_reset()
> 
> 3. Reset the irq subsystem
> 	spapr_irq_reset();
> 
> However (1) implicitly changes the interrupt delivery mode, because
> whether we're using XICS or XIVE depends on the CAS state.  We don't
> properly initialize the new irq mode until (3) though - in particular
> setting up the KVM devices.
> 
> During (2), we can temporarily drop the BQL allowing some irqs to be
> delivered which will go to an irq system that's not properly set up.
> 
> Specifically, if the previous guest was in (KVM) XIVE mode, the CAS
> reset will put us back in XICS mode.  kvm_kernel_irqchip() still
> returns true, because XIVE was using KVM, however XICs doesn't have
> its KVM components intialized and kernel_xics_fd == -1.  When the irq
> is delivered it goes via ics_kvm_set_irq() which assert()s that
> kernel_xics_fd != -1.
> 
> This change addresses the problem by delaying the CAS reset until
> after the devices reset.  The device reset should quiesce all the
> devices so we won't get irqs delivered while we mess around with the
> IRQ.  The CAS reset and irq re-initialize should also now be under the
> same BQL critical section so nothing else should be able to interrupt
> it either.
> 
> We also move the spapr_irq_msi_reset() used in one of the legacy irq
> modes, since it logically makes sense at the same point as the
> spapr_irq_reset() (it's essentially an equivalent operation for older
> machine types).  Since we don't need to switch between different
> interrupt controllers for those old machine types it shouldn't
> actually be broken in those cases though.
> 
> Cc: Cédric Le Goater <clg@kaod.org>
> 
> Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend"
> Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting
>                  XIVE and XICS"
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 821f0d4a49..12ed4b065c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr_setup_hpt_and_vrma(spapr);
>      }
>  
> +    /*
> +     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> +     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> +     * called from vPHB reset handler so we initialize the counter here.
> +     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> +     * must be equally distant from any other node.
> +     * The final value of spapr->gpu_numa_id is going to be written to
> +     * max-associativity-domains in spapr_build_fdt().
> +     */
> +    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
> +    qemu_devices_reset();
> +
>      /*
>       * If this reset wasn't generated by CAS, we should reset our
>       * negotiated options and start from scratch
> @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr_irq_msi_reset(spapr);
>      }
>  
> -    /*
> -     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> -     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> -     * called from vPHB reset handler so we initialize the counter here.
> -     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> -     * must be equally distant from any other node.
> -     * The final value of spapr->gpu_numa_id is going to be written to
> -     * max-associativity-domains in spapr_build_fdt().
> -     */
> -    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
> -    qemu_devices_reset();
> -
>      /*
>       * This is fixing some of the default configuration of the XIVE
>       * devices. To be called after the reset of the machine devices.
> 

This commit breaks migration between POWER8 <-> POWER9 hosts:

qemu-system-ppc64: error while loading state for instance 0x1 of device 'cpu'
qemu-system-ppc64: load of migration failed: Operation not permitted

Using a guest with a running 4.18 kernel (RHEL 8) and "-M pseries,max-cpu-compat=power8" on both sides.

There is no problem if both hosts are of the same kind ( P8 <-> P8 or P9 <-> P9).

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
  2019-08-22 20:07   ` [Qemu-devel] [Qemu-ppc] " Laurent Vivier
@ 2019-08-23  5:39     ` David Gibson
  2019-08-23 13:47       ` Laurent Vivier
  0 siblings, 1 reply; 11+ messages in thread
From: David Gibson @ 2019-08-23  5:39 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: peter.maydell, groug, qemu-ppc, clg, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5408 bytes --]

On Thu, Aug 22, 2019 at 10:07:09PM +0200, Laurent Vivier wrote:
> On 13/08/2019 08:59, David Gibson wrote:
> > This fixes a nasty regression in qemu-4.1 for the 'pseries' machine,
> > caused by the new "dual" interrupt controller model.  Specifically,
> > qemu can crash when used with KVM if a 'system_reset' is requested
> > while there's active I/O in the guest.
> > 
> > The problem is that in spapr_machine_reset() we:
> > 
> > 1. Reset the CAS vector state
> > 	spapr_ovec_cleanup(spapr->ov5_cas);
> > 
> > 2. Reset all devices
> > 	qemu_devices_reset()
> > 
> > 3. Reset the irq subsystem
> > 	spapr_irq_reset();
> > 
> > However (1) implicitly changes the interrupt delivery mode, because
> > whether we're using XICS or XIVE depends on the CAS state.  We don't
> > properly initialize the new irq mode until (3) though - in particular
> > setting up the KVM devices.
> > 
> > During (2), we can temporarily drop the BQL allowing some irqs to be
> > delivered which will go to an irq system that's not properly set up.
> > 
> > Specifically, if the previous guest was in (KVM) XIVE mode, the CAS
> > reset will put us back in XICS mode.  kvm_kernel_irqchip() still
> > returns true, because XIVE was using KVM, however XICs doesn't have
> > its KVM components intialized and kernel_xics_fd == -1.  When the irq
> > is delivered it goes via ics_kvm_set_irq() which assert()s that
> > kernel_xics_fd != -1.
> > 
> > This change addresses the problem by delaying the CAS reset until
> > after the devices reset.  The device reset should quiesce all the
> > devices so we won't get irqs delivered while we mess around with the
> > IRQ.  The CAS reset and irq re-initialize should also now be under the
> > same BQL critical section so nothing else should be able to interrupt
> > it either.
> > 
> > We also move the spapr_irq_msi_reset() used in one of the legacy irq
> > modes, since it logically makes sense at the same point as the
> > spapr_irq_reset() (it's essentially an equivalent operation for older
> > machine types).  Since we don't need to switch between different
> > interrupt controllers for those old machine types it shouldn't
> > actually be broken in those cases though.
> > 
> > Cc: Cédric Le Goater <clg@kaod.org>
> > 
> > Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend"
> > Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting
> >                  XIVE and XICS"
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/ppc/spapr.c | 24 ++++++++++++------------
> >  1 file changed, 12 insertions(+), 12 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 821f0d4a49..12ed4b065c 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState *machine)
> >          spapr_setup_hpt_and_vrma(spapr);
> >      }
> >  
> > +    /*
> > +     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> > +     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> > +     * called from vPHB reset handler so we initialize the counter here.
> > +     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> > +     * must be equally distant from any other node.
> > +     * The final value of spapr->gpu_numa_id is going to be written to
> > +     * max-associativity-domains in spapr_build_fdt().
> > +     */
> > +    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
> > +    qemu_devices_reset();
> > +
> >      /*
> >       * If this reset wasn't generated by CAS, we should reset our
> >       * negotiated options and start from scratch
> > @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState *machine)
> >          spapr_irq_msi_reset(spapr);
> >      }
> >  
> > -    /*
> > -     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> > -     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> > -     * called from vPHB reset handler so we initialize the counter here.
> > -     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> > -     * must be equally distant from any other node.
> > -     * The final value of spapr->gpu_numa_id is going to be written to
> > -     * max-associativity-domains in spapr_build_fdt().
> > -     */
> > -    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
> > -    qemu_devices_reset();
> > -
> >      /*
> >       * This is fixing some of the default configuration of the XIVE
> >       * devices. To be called after the reset of the machine devices.
> > 
> 
> This commit breaks migration between POWER8 <-> POWER9 hosts:
> 
> qemu-system-ppc64: error while loading state for instance 0x1 of device 'cpu'
> qemu-system-ppc64: load of migration failed: Operation not permitted
> 
> Using a guest with a running 4.18 kernel (RHEL 8) and "-M pseries,max-cpu-compat=power8" on both sides.
> 
> There is no problem if both hosts are of the same kind ( P8 <-> P8 or P9 <-> P9).

Crud, I was afraid there might be some subtle dependency on the
reverse order.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
  2019-08-23  5:39     ` David Gibson
@ 2019-08-23 13:47       ` Laurent Vivier
  2019-08-26  7:47         ` David Gibson
  0 siblings, 1 reply; 11+ messages in thread
From: Laurent Vivier @ 2019-08-23 13:47 UTC (permalink / raw)
  To: David Gibson; +Cc: peter.maydell, groug, qemu-ppc, clg, qemu-devel

On 23/08/2019 07:39, David Gibson wrote:
> On Thu, Aug 22, 2019 at 10:07:09PM +0200, Laurent Vivier wrote:
>> On 13/08/2019 08:59, David Gibson wrote:
>>> This fixes a nasty regression in qemu-4.1 for the 'pseries' machine,
>>> caused by the new "dual" interrupt controller model.  Specifically,
>>> qemu can crash when used with KVM if a 'system_reset' is requested
>>> while there's active I/O in the guest.
>>>
>>> The problem is that in spapr_machine_reset() we:
>>>
>>> 1. Reset the CAS vector state
>>> 	spapr_ovec_cleanup(spapr->ov5_cas);
>>>
>>> 2. Reset all devices
>>> 	qemu_devices_reset()
>>>
>>> 3. Reset the irq subsystem
>>> 	spapr_irq_reset();
>>>
>>> However (1) implicitly changes the interrupt delivery mode, because
>>> whether we're using XICS or XIVE depends on the CAS state.  We don't
>>> properly initialize the new irq mode until (3) though - in particular
>>> setting up the KVM devices.
>>>
>>> During (2), we can temporarily drop the BQL allowing some irqs to be
>>> delivered which will go to an irq system that's not properly set up.
>>>
>>> Specifically, if the previous guest was in (KVM) XIVE mode, the CAS
>>> reset will put us back in XICS mode.  kvm_kernel_irqchip() still
>>> returns true, because XIVE was using KVM, however XICs doesn't have
>>> its KVM components intialized and kernel_xics_fd == -1.  When the irq
>>> is delivered it goes via ics_kvm_set_irq() which assert()s that
>>> kernel_xics_fd != -1.
>>>
>>> This change addresses the problem by delaying the CAS reset until
>>> after the devices reset.  The device reset should quiesce all the
>>> devices so we won't get irqs delivered while we mess around with the
>>> IRQ.  The CAS reset and irq re-initialize should also now be under the
>>> same BQL critical section so nothing else should be able to interrupt
>>> it either.
>>>
>>> We also move the spapr_irq_msi_reset() used in one of the legacy irq
>>> modes, since it logically makes sense at the same point as the
>>> spapr_irq_reset() (it's essentially an equivalent operation for older
>>> machine types).  Since we don't need to switch between different
>>> interrupt controllers for those old machine types it shouldn't
>>> actually be broken in those cases though.
>>>
>>> Cc: Cédric Le Goater <clg@kaod.org>
>>>
>>> Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend"
>>> Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting
>>>                  XIVE and XICS"
>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>> ---
>>>  hw/ppc/spapr.c | 24 ++++++++++++------------
>>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index 821f0d4a49..12ed4b065c 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState *machine)
>>>          spapr_setup_hpt_and_vrma(spapr);
>>>      }
>>>  
>>> +    /*
>>> +     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
>>> +     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
>>> +     * called from vPHB reset handler so we initialize the counter here.
>>> +     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
>>> +     * must be equally distant from any other node.
>>> +     * The final value of spapr->gpu_numa_id is going to be written to
>>> +     * max-associativity-domains in spapr_build_fdt().
>>> +     */
>>> +    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
>>> +    qemu_devices_reset();
>>> +
>>>      /*
>>>       * If this reset wasn't generated by CAS, we should reset our
>>>       * negotiated options and start from scratch
>>> @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState *machine)
>>>          spapr_irq_msi_reset(spapr);
>>>      }
>>>  
>>> -    /*
>>> -     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
>>> -     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
>>> -     * called from vPHB reset handler so we initialize the counter here.
>>> -     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
>>> -     * must be equally distant from any other node.
>>> -     * The final value of spapr->gpu_numa_id is going to be written to
>>> -     * max-associativity-domains in spapr_build_fdt().
>>> -     */
>>> -    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
>>> -    qemu_devices_reset();
>>> -
>>>      /*
>>>       * This is fixing some of the default configuration of the XIVE
>>>       * devices. To be called after the reset of the machine devices.
>>>
>>
>> This commit breaks migration between POWER8 <-> POWER9 hosts:
>>
>> qemu-system-ppc64: error while loading state for instance 0x1 of device 'cpu'
>> qemu-system-ppc64: load of migration failed: Operation not permitted
>>
>> Using a guest with a running 4.18 kernel (RHEL 8) and "-M pseries,max-cpu-compat=power8" on both sides.
>>
>> There is no problem if both hosts are of the same kind ( P8 <-> P8 or P9 <-> P9).
> 
> Crud, I was afraid there might be some subtle dependency on the
> reverse order.

It seems the side effect of patch in comment 5 is to add a supplementary field compat_pvr in CPUs in the migration stream:

        {
            "name": "cpu",
            "instance_id": 0,
            "vmsd_name": "cpu",
            "version": 5,
...
            "subsections": [
...
                {
                    "vmsd_name": "cpu/compat",
                    "version": 1,
                    "fields": [
                        {
                            "name": "compat_pvr",
                            "type": "uint32",
                            "size": 4
                        }
                    ]
                }
            ]
        },
...

What seems to happen is compat_pvr is not propagated correctly to all
 CPUs.

Originally, spapr_machine_reset() calls ppc_set_compat() to set the 
value to max_compat_pvr for the first cpu and this was propagated to all 
CPUs by spapr_cpu_reset().
Now, as spapr_cpu_reset() is called before that, the value is not 
propagated to all CPUs.

A simple fix seems to be:

--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1752,7 +1752,7 @@ static void spapr_machine_reset(MachineState *machine)
         spapr_ovec_cleanup(spapr->ov5_cas);
         spapr->ov5_cas = spapr_ovec_new();
 
-        ppc_set_compat(first_ppc_cpu, spapr->max_compat_pvr, &error_fatal);
+        ppc_set_compat_all(spapr->max_compat_pvr, &error_fatal);
     }
 
     /*

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
  2019-08-23 13:47       ` Laurent Vivier
@ 2019-08-26  7:47         ` David Gibson
  0 siblings, 0 replies; 11+ messages in thread
From: David Gibson @ 2019-08-26  7:47 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: peter.maydell, groug, qemu-ppc, clg, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 7460 bytes --]

On Fri, Aug 23, 2019 at 03:47:52PM +0200, Laurent Vivier wrote:
> On 23/08/2019 07:39, David Gibson wrote:
> > On Thu, Aug 22, 2019 at 10:07:09PM +0200, Laurent Vivier wrote:
> >> On 13/08/2019 08:59, David Gibson wrote:
> >>> This fixes a nasty regression in qemu-4.1 for the 'pseries' machine,
> >>> caused by the new "dual" interrupt controller model.  Specifically,
> >>> qemu can crash when used with KVM if a 'system_reset' is requested
> >>> while there's active I/O in the guest.
> >>>
> >>> The problem is that in spapr_machine_reset() we:
> >>>
> >>> 1. Reset the CAS vector state
> >>> 	spapr_ovec_cleanup(spapr->ov5_cas);
> >>>
> >>> 2. Reset all devices
> >>> 	qemu_devices_reset()
> >>>
> >>> 3. Reset the irq subsystem
> >>> 	spapr_irq_reset();
> >>>
> >>> However (1) implicitly changes the interrupt delivery mode, because
> >>> whether we're using XICS or XIVE depends on the CAS state.  We don't
> >>> properly initialize the new irq mode until (3) though - in particular
> >>> setting up the KVM devices.
> >>>
> >>> During (2), we can temporarily drop the BQL allowing some irqs to be
> >>> delivered which will go to an irq system that's not properly set up.
> >>>
> >>> Specifically, if the previous guest was in (KVM) XIVE mode, the CAS
> >>> reset will put us back in XICS mode.  kvm_kernel_irqchip() still
> >>> returns true, because XIVE was using KVM, however XICs doesn't have
> >>> its KVM components intialized and kernel_xics_fd == -1.  When the irq
> >>> is delivered it goes via ics_kvm_set_irq() which assert()s that
> >>> kernel_xics_fd != -1.
> >>>
> >>> This change addresses the problem by delaying the CAS reset until
> >>> after the devices reset.  The device reset should quiesce all the
> >>> devices so we won't get irqs delivered while we mess around with the
> >>> IRQ.  The CAS reset and irq re-initialize should also now be under the
> >>> same BQL critical section so nothing else should be able to interrupt
> >>> it either.
> >>>
> >>> We also move the spapr_irq_msi_reset() used in one of the legacy irq
> >>> modes, since it logically makes sense at the same point as the
> >>> spapr_irq_reset() (it's essentially an equivalent operation for older
> >>> machine types).  Since we don't need to switch between different
> >>> interrupt controllers for those old machine types it shouldn't
> >>> actually be broken in those cases though.
> >>>
> >>> Cc: Cédric Le Goater <clg@kaod.org>
> >>>
> >>> Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend"
> >>> Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting
> >>>                  XIVE and XICS"
> >>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >>> ---
> >>>  hw/ppc/spapr.c | 24 ++++++++++++------------
> >>>  1 file changed, 12 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 821f0d4a49..12ed4b065c 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState *machine)
> >>>          spapr_setup_hpt_and_vrma(spapr);
> >>>      }
> >>>  
> >>> +    /*
> >>> +     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> >>> +     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> >>> +     * called from vPHB reset handler so we initialize the counter here.
> >>> +     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> >>> +     * must be equally distant from any other node.
> >>> +     * The final value of spapr->gpu_numa_id is going to be written to
> >>> +     * max-associativity-domains in spapr_build_fdt().
> >>> +     */
> >>> +    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
> >>> +    qemu_devices_reset();
> >>> +
> >>>      /*
> >>>       * If this reset wasn't generated by CAS, we should reset our
> >>>       * negotiated options and start from scratch
> >>> @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState *machine)
> >>>          spapr_irq_msi_reset(spapr);
> >>>      }
> >>>  
> >>> -    /*
> >>> -     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> >>> -     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> >>> -     * called from vPHB reset handler so we initialize the counter here.
> >>> -     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> >>> -     * must be equally distant from any other node.
> >>> -     * The final value of spapr->gpu_numa_id is going to be written to
> >>> -     * max-associativity-domains in spapr_build_fdt().
> >>> -     */
> >>> -    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
> >>> -    qemu_devices_reset();
> >>> -
> >>>      /*
> >>>       * This is fixing some of the default configuration of the XIVE
> >>>       * devices. To be called after the reset of the machine devices.
> >>>
> >>
> >> This commit breaks migration between POWER8 <-> POWER9 hosts:
> >>
> >> qemu-system-ppc64: error while loading state for instance 0x1 of device 'cpu'
> >> qemu-system-ppc64: load of migration failed: Operation not permitted
> >>
> >> Using a guest with a running 4.18 kernel (RHEL 8) and "-M pseries,max-cpu-compat=power8" on both sides.
> >>
> >> There is no problem if both hosts are of the same kind ( P8 <-> P8 or P9 <-> P9).
> > 
> > Crud, I was afraid there might be some subtle dependency on the
> > reverse order.
> 
> It seems the side effect of patch in comment 5 is to add a supplementary field compat_pvr in CPUs in the migration stream:
> 
>         {
>             "name": "cpu",
>             "instance_id": 0,
>             "vmsd_name": "cpu",
>             "version": 5,
> ...
>             "subsections": [
> ...
>                 {
>                     "vmsd_name": "cpu/compat",
>                     "version": 1,
>                     "fields": [
>                         {
>                             "name": "compat_pvr",
>                             "type": "uint32",
>                             "size": 4
>                         }
>                     ]
>                 }
>             ]
>         },
> ...
> 
> What seems to happen is compat_pvr is not propagated correctly to all
>  CPUs.
> 
> Originally, spapr_machine_reset() calls ppc_set_compat() to set the 
> value to max_compat_pvr for the first cpu and this was propagated to all 
> CPUs by spapr_cpu_reset().
> Now, as spapr_cpu_reset() is called before that, the value is not 
> propagated to all CPUs.
> 
> A simple fix seems to be:
> 
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1752,7 +1752,7 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr_ovec_cleanup(spapr->ov5_cas);
>          spapr->ov5_cas = spapr_ovec_new();
>  
> -        ppc_set_compat(first_ppc_cpu, spapr->max_compat_pvr, &error_fatal);
> +        ppc_set_compat_all(spapr->max_compat_pvr, &error_fatal);
>      }
>  
>      /*
> 

As discussed on our call, I think this is a good fix at least in the
medium term.  If you can add in some explanatory comments and post it
formally ASAP.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-08-26  8:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-13  6:59 [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 David Gibson
2019-08-13  6:59 ` [Qemu-devel] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices David Gibson
2019-08-22 20:07   ` [Qemu-devel] [Qemu-ppc] " Laurent Vivier
2019-08-23  5:39     ` David Gibson
2019-08-23 13:47       ` Laurent Vivier
2019-08-26  7:47         ` David Gibson
2019-08-13  6:59 ` [Qemu-devel] [PULL 2/2] spapr/xive: Fix migration of hot-plugged CPUs David Gibson
2019-08-13  9:23 ` [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 Peter Maydell
2019-08-13 11:45   ` Peter Maydell
2019-08-13 12:04     ` Cédric Le Goater
2019-08-13 14:16     ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).