All of lore.kernel.org
 help / color / mirror / Atom feed
From: Laurent Vivier <lvivier@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: peter.maydell@linaro.org, groug@kaod.org, qemu-ppc@nongnu.org,
	clg@kaod.org, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
Date: Fri, 23 Aug 2019 15:47:52 +0200	[thread overview]
Message-ID: <ab12dbd1-b925-d384-986f-151f12e3a056@redhat.com> (raw)
In-Reply-To: <20190823053944.GC3027@umbus.fritz.box>

On 23/08/2019 07:39, David Gibson wrote:
> On Thu, Aug 22, 2019 at 10:07:09PM +0200, Laurent Vivier wrote:
>> On 13/08/2019 08:59, David Gibson wrote:
>>> This fixes a nasty regression in qemu-4.1 for the 'pseries' machine,
>>> caused by the new "dual" interrupt controller model.  Specifically,
>>> qemu can crash when used with KVM if a 'system_reset' is requested
>>> while there's active I/O in the guest.
>>>
>>> The problem is that in spapr_machine_reset() we:
>>>
>>> 1. Reset the CAS vector state
>>> 	spapr_ovec_cleanup(spapr->ov5_cas);
>>>
>>> 2. Reset all devices
>>> 	qemu_devices_reset()
>>>
>>> 3. Reset the irq subsystem
>>> 	spapr_irq_reset();
>>>
>>> However (1) implicitly changes the interrupt delivery mode, because
>>> whether we're using XICS or XIVE depends on the CAS state.  We don't
>>> properly initialize the new irq mode until (3) though - in particular
>>> setting up the KVM devices.
>>>
>>> During (2), we can temporarily drop the BQL allowing some irqs to be
>>> delivered which will go to an irq system that's not properly set up.
>>>
>>> Specifically, if the previous guest was in (KVM) XIVE mode, the CAS
>>> reset will put us back in XICS mode.  kvm_kernel_irqchip() still
>>> returns true, because XIVE was using KVM, however XICs doesn't have
>>> its KVM components intialized and kernel_xics_fd == -1.  When the irq
>>> is delivered it goes via ics_kvm_set_irq() which assert()s that
>>> kernel_xics_fd != -1.
>>>
>>> This change addresses the problem by delaying the CAS reset until
>>> after the devices reset.  The device reset should quiesce all the
>>> devices so we won't get irqs delivered while we mess around with the
>>> IRQ.  The CAS reset and irq re-initialize should also now be under the
>>> same BQL critical section so nothing else should be able to interrupt
>>> it either.
>>>
>>> We also move the spapr_irq_msi_reset() used in one of the legacy irq
>>> modes, since it logically makes sense at the same point as the
>>> spapr_irq_reset() (it's essentially an equivalent operation for older
>>> machine types).  Since we don't need to switch between different
>>> interrupt controllers for those old machine types it shouldn't
>>> actually be broken in those cases though.
>>>
>>> Cc: Cédric Le Goater <clg@kaod.org>
>>>
>>> Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend"
>>> Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting
>>>                  XIVE and XICS"
>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>> ---
>>>  hw/ppc/spapr.c | 24 ++++++++++++------------
>>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index 821f0d4a49..12ed4b065c 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState *machine)
>>>          spapr_setup_hpt_and_vrma(spapr);
>>>      }
>>>  
>>> +    /*
>>> +     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
>>> +     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
>>> +     * called from vPHB reset handler so we initialize the counter here.
>>> +     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
>>> +     * must be equally distant from any other node.
>>> +     * The final value of spapr->gpu_numa_id is going to be written to
>>> +     * max-associativity-domains in spapr_build_fdt().
>>> +     */
>>> +    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
>>> +    qemu_devices_reset();
>>> +
>>>      /*
>>>       * If this reset wasn't generated by CAS, we should reset our
>>>       * negotiated options and start from scratch
>>> @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState *machine)
>>>          spapr_irq_msi_reset(spapr);
>>>      }
>>>  
>>> -    /*
>>> -     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
>>> -     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
>>> -     * called from vPHB reset handler so we initialize the counter here.
>>> -     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
>>> -     * must be equally distant from any other node.
>>> -     * The final value of spapr->gpu_numa_id is going to be written to
>>> -     * max-associativity-domains in spapr_build_fdt().
>>> -     */
>>> -    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
>>> -    qemu_devices_reset();
>>> -
>>>      /*
>>>       * This is fixing some of the default configuration of the XIVE
>>>       * devices. To be called after the reset of the machine devices.
>>>
>>
>> This commit breaks migration between POWER8 <-> POWER9 hosts:
>>
>> qemu-system-ppc64: error while loading state for instance 0x1 of device 'cpu'
>> qemu-system-ppc64: load of migration failed: Operation not permitted
>>
>> Using a guest with a running 4.18 kernel (RHEL 8) and "-M pseries,max-cpu-compat=power8" on both sides.
>>
>> There is no problem if both hosts are of the same kind ( P8 <-> P8 or P9 <-> P9).
> 
> Crud, I was afraid there might be some subtle dependency on the
> reverse order.

It seems the side effect of patch in comment 5 is to add a supplementary field compat_pvr in CPUs in the migration stream:

        {
            "name": "cpu",
            "instance_id": 0,
            "vmsd_name": "cpu",
            "version": 5,
...
            "subsections": [
...
                {
                    "vmsd_name": "cpu/compat",
                    "version": 1,
                    "fields": [
                        {
                            "name": "compat_pvr",
                            "type": "uint32",
                            "size": 4
                        }
                    ]
                }
            ]
        },
...

What seems to happen is compat_pvr is not propagated correctly to all
 CPUs.

Originally, spapr_machine_reset() calls ppc_set_compat() to set the 
value to max_compat_pvr for the first cpu and this was propagated to all 
CPUs by spapr_cpu_reset().
Now, as spapr_cpu_reset() is called before that, the value is not 
propagated to all CPUs.

A simple fix seems to be:

--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1752,7 +1752,7 @@ static void spapr_machine_reset(MachineState *machine)
         spapr_ovec_cleanup(spapr->ov5_cas);
         spapr->ov5_cas = spapr_ovec_new();
 
-        ppc_set_compat(first_ppc_cpu, spapr->max_compat_pvr, &error_fatal);
+        ppc_set_compat_all(spapr->max_compat_pvr, &error_fatal);
     }
 
     /*

Thanks,
Laurent


  reply	other threads:[~2019-08-23 13:51 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-13  6:59 [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 David Gibson
2019-08-13  6:59 ` [Qemu-devel] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices David Gibson
2019-08-22 20:07   ` [Qemu-devel] [Qemu-ppc] " Laurent Vivier
2019-08-23  5:39     ` David Gibson
2019-08-23 13:47       ` Laurent Vivier [this message]
2019-08-26  7:47         ` David Gibson
2019-08-13  6:59 ` [Qemu-devel] [PULL 2/2] spapr/xive: Fix migration of hot-plugged CPUs David Gibson
2019-08-13  9:23 ` [Qemu-devel] [PULL 0/2] ppc-for-4.1 queue 20190813 Peter Maydell
2019-08-13 11:45   ` Peter Maydell
2019-08-13 12:04     ` Cédric Le Goater
2019-08-13 14:16     ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ab12dbd1-b925-d384-986f-151f12e3a056@redhat.com \
    --to=lvivier@redhat.com \
    --cc=clg@kaod.org \
    --cc=david@gibson.dropbear.id.au \
    --cc=groug@kaod.org \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.