XIVE VFIO kernel resample failure in INTx mode under heavy load

All of lore.kernel.org
 help / color / mirror / Atom feed

* XIVE VFIO kernel resample failure in INTx mode under heavy load
@ 2022-03-11 18:35 Timothy Pearson
  2022-03-11 18:53 ` Timothy Pearson
  2022-03-14 22:09 ` Alex Williamson
  0 siblings, 2 replies; 15+ messages in thread
From: Timothy Pearson @ 2022-03-11 18:35 UTC (permalink / raw)
  To: qemu-devel

All,

I've been struggling for some time with what is looking like a potential bug in QEMU/KVM on the POWER9 platform.  It appears that in XIVE mode, when the in-kernel IRQ chip is enabled, an external device that rapidly asserts IRQs via the legacy INTx level mechanism will only receive one interrupt in the KVM guest.

Changing any one of those items appears to avoid the glitch, e.g. XICS mode with the in-kernel IRQ chip works (all interrupts are passed through), and XIVE mode with the in-kernel IRQ chip disabled also works.  We are also not seeing any problems in XIVE mode with the in-kernel chip from MSI/MSI-X devices.

The device in question is a real time card that needs to raise an interrupt every 1ms.  It works perfectly on the host, but fails in the guest -- with the in-kernel IRQ chip and XIVE enabled, it receives exactly one interrupt, at which point the host continues to see INTx+ but the guest sees INTX-, and the IRQ handler in the guest kernel is never reentered.

We have also seen some very rare glitches where, over a long period of time, we can enter a similar deadlock in XICS mode.  Disabling the in-kernel IRQ chip in XIVE mode will also lead to the lockup with this device, since the userspace IRQ emulation cannot keep up with the rapid interrupt firing (measurements show around 100ms required for processing each interrupt in the user mode).

My understanding is the resample mechanism does some clever tricks with level IRQs, but that QEMU needs to check if the IRQ is still asserted by the device on guest EOI.  Since a failure here would explain these symptoms I'm wondering if there is a bug in either QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not resampled and therefore not re-fired in the guest?

Unfortunately I lack the resources at the moment to dig through the QEMU codebase and try to find the bug.  Any IBMers here that might be able to help out?  I can provide access to a test setup if desired.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-03-11 18:35 XIVE VFIO kernel resample failure in INTx mode under heavy load Timothy Pearson
@ 2022-03-11 18:53 ` Timothy Pearson
  2022-03-14 22:09 ` Alex Williamson
  1 sibling, 0 replies; 15+ messages in thread
From: Timothy Pearson @ 2022-03-11 18:53 UTC (permalink / raw)
  To: qemu-devel

Correction -- the desynchronization appears to be on the DisINTx line.

Host:
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=slow >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

Guest:
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=slow >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

This is with the driver stuck, not receiving any interrupts in the guest despite the card issuing them every 1ms.

----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "qemu-devel" <qemu-devel@nongnu.org>
> Sent: Friday, March 11, 2022 12:35:45 PM
> Subject: XIVE VFIO kernel resample failure in INTx mode under heavy load

> All,
> 
> I've been struggling for some time with what is looking like a potential bug in
> QEMU/KVM on the POWER9 platform.  It appears that in XIVE mode, when the
> in-kernel IRQ chip is enabled, an external device that rapidly asserts IRQs via
> the legacy INTx level mechanism will only receive one interrupt in the KVM
> guest.
> 
> Changing any one of those items appears to avoid the glitch, e.g. XICS mode with
> the in-kernel IRQ chip works (all interrupts are passed through), and XIVE mode
> with the in-kernel IRQ chip disabled also works.  We are also not seeing any
> problems in XIVE mode with the in-kernel chip from MSI/MSI-X devices.
> 
> The device in question is a real time card that needs to raise an interrupt
> every 1ms.  It works perfectly on the host, but fails in the guest -- with the
> in-kernel IRQ chip and XIVE enabled, it receives exactly one interrupt, at
> which point the host continues to see INTx+ but the guest sees INTX-, and the
> IRQ handler in the guest kernel is never reentered.
> 
> We have also seen some very rare glitches where, over a long period of time, we
> can enter a similar deadlock in XICS mode.  Disabling the in-kernel IRQ chip in
> XIVE mode will also lead to the lockup with this device, since the userspace
> IRQ emulation cannot keep up with the rapid interrupt firing (measurements show
> around 100ms required for processing each interrupt in the user mode).
> 
> My understanding is the resample mechanism does some clever tricks with level
> IRQs, but that QEMU needs to check if the IRQ is still asserted by the device
> on guest EOI.  Since a failure here would explain these symptoms I'm wondering
> if there is a bug in either QEMU or KVM for POWER / pSeries (SPAPr) where the
> IRQ is not resampled and therefore not re-fired in the guest?
> 
> Unfortunately I lack the resources at the moment to dig through the QEMU
> codebase and try to find the bug.  Any IBMers here that might be able to help
> out?  I can provide access to a test setup if desired.
> 
> Thanks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-03-11 18:35 XIVE VFIO kernel resample failure in INTx mode under heavy load Timothy Pearson
  2022-03-11 18:53 ` Timothy Pearson
@ 2022-03-14 22:09 ` Alex Williamson
  2022-03-16 16:29   ` Cédric Le Goater
  2022-03-22  8:31   ` Alexey Kardashevskiy
  1 sibling, 2 replies; 15+ messages in thread
From: Alex Williamson @ 2022-03-14 22:09 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Alexey Kardashevskiy, qemu-devel

[Cc +Alexey]

On Fri, 11 Mar 2022 12:35:45 -0600 (CST)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

> All,
> 
> I've been struggling for some time with what is looking like a
> potential bug in QEMU/KVM on the POWER9 platform.  It appears that in
> XIVE mode, when the in-kernel IRQ chip is enabled, an external device
> that rapidly asserts IRQs via the legacy INTx level mechanism will
> only receive one interrupt in the KVM guest.
> 
> Changing any one of those items appears to avoid the glitch, e.g.
> XICS mode with the in-kernel IRQ chip works (all interrupts are
> passed through), and XIVE mode with the in-kernel IRQ chip disabled
> also works.  We are also not seeing any problems in XIVE mode with
> the in-kernel chip from MSI/MSI-X devices.
> 
> The device in question is a real time card that needs to raise an
> interrupt every 1ms.  It works perfectly on the host, but fails in
> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
> receives exactly one interrupt, at which point the host continues to
> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
> kernel is never reentered.
> 
> We have also seen some very rare glitches where, over a long period
> of time, we can enter a similar deadlock in XICS mode.  Disabling the
> in-kernel IRQ chip in XIVE mode will also lead to the lockup with
> this device, since the userspace IRQ emulation cannot keep up with
> the rapid interrupt firing (measurements show around 100ms required
> for processing each interrupt in the user mode).
> 
> My understanding is the resample mechanism does some clever tricks
> with level IRQs, but that QEMU needs to check if the IRQ is still
> asserted by the device on guest EOI.  Since a failure here would
> explain these symptoms I'm wondering if there is a bug in either QEMU
> or KVM for POWER / pSeries (SPAPr) where the IRQ is not resampled and
> therefore not re-fired in the guest?
> 
> Unfortunately I lack the resources at the moment to dig through the
> QEMU codebase and try to find the bug.  Any IBMers here that might be
> able to help out?  I can provide access to a test setup if desired.

Your experiments with in-kernel vs QEMU irqchip would suggest to me
that both the device and the generic INTx handling code are working
correctly, though it's hard to say that definitively given the massive
timing differences.

As an experiment, does anything change with the "nointxmask=1" vfio-pci
module option?

Adding Alexey, I have zero XIVE knowledge myself. Thanks,

Alex



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-03-14 22:09 ` Alex Williamson
@ 2022-03-16 16:29   ` Cédric Le Goater
  2022-03-16 19:16     ` Cédric Le Goater
  2022-03-22  8:31   ` Alexey Kardashevskiy
  1 sibling, 1 reply; 15+ messages in thread
From: Cédric Le Goater @ 2022-03-16 16:29 UTC (permalink / raw)
  To: Alex Williamson, Timothy Pearson
  Cc: Alexey Kardashevskiy, list@suse.de:PowerPC, qemu-devel

Hello,

> I've been struggling for some time with what is looking like a
> potential bug in QEMU/KVM on the POWER9 platform.  It appears that
> in XIVE mode, when the in-kernel IRQ chip is enabled, an external
> device that rapidly asserts IRQs via the legacy INTx level mechanism
> will only receive one interrupt in the KVM guest.

Indeed. I could reproduce with a pass-through PCI adapter using
'pci=nomsi'. The virtio devices operate correctly but the network
adapter only receives one event (*):

$ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
  16:       2198       1378       1519       1216          0          0          0          0  XIVE-IPI   0 Edge      IPI-0
  17:          0          0          0          0       2003       1936       1335       1507  XIVE-IPI   1 Edge      IPI-1
  18:          0       6401          0          0          0          0          0          0  XIVE-IRQ 4609 Level     virtio3, virtio0, virtio2
  19:          0          0          0          0          0        204          0          0  XIVE-IRQ 4610 Level     virtio1
  20:          0          0          0          0          0          0          0          0  XIVE-IRQ 4608 Level     xhci-hcd:usb1
  21:          0          1          0          0          0          0          0          0  XIVE-IRQ 4612 Level     eth1 (*)
  23:          0          0          0          0          0          0          0          0  XIVE-IRQ 4096 Edge      RAS_EPOW
  24:          0          0          0          0          0          0          0          0  XIVE-IRQ 4592 Edge      hvc_console
  26:          0          0          0          0          0          0          0          0  XIVE-IRQ 4097 Edge      RAS_HOTPLUG

> Changing any one of those items appears to avoid the glitch, e.g. XICS

XICS is very different from XIVE. The driver implements the previous
interrupt controller architecture (P5-P8) and the hypervisor mediates
the delivery to the guest. With XIVE, vCPUs are directly signaled by
the IC. When under KVM, we use different KVM devices for each mode :

* KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
   not using the XIVE native interface. RHEL7 for instance.
* KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
   using the XIVE native interface. Linux > 4.14.
* KVM XICS is for P8 hosts (no XIVE HW)

VFIO adds some complexity with the source events. I think the problem
comes from the assertion state. I will talk about it later.

> mode with the in-kernel IRQ chip works (all interrupts are passed
> through),

All interrupts are passed through using XIVE also. Run 'info pic' in
the monitor. On the host, check the IRQ mapping in the debugfs file :

   /sys/kernel/debug/powerpc/kvm-xive-*

> and XIVE mode with the in-kernel IRQ chip disabled also works. 

In that case, no KVM device backs the QEMU device and all state
is in one place.

> We
> are also not seeing any problems in XIVE mode with the in-kernel
> chip from MSI/MSI-X devices.

Yes. pass-through devices are expected to operate correctly :)

> The device in question is a real time card that needs to raise an
> interrupt every 1ms.  It works perfectly on the host, but fails in
> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
> receives exactly one interrupt, at which point the host continues to
> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
> kernel is never reentered.

ok. Same symptom as the scenario above.

> We have also seen some very rare glitches where, over a long period
> of time, we can enter a similar deadlock in XICS mode.

with the in-kernel XICS IRQ chip ?

> Disabling
> the in-kernel IRQ chip in XIVE mode will also lead to the lockup
> with this device, since the userspace IRQ emulation cannot keep up
> with the rapid interrupt firing (measurements show around 100ms
> required for processing each interrupt in the user mode).

MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
it is handled as a special case in the device/driver. To maintain the
assertion state, all LSI handling is done with a special HCALL :
H_INT_ESB which is implemented in QEMU. This generates a lot of back
and forth in the KVM stack.

> My understanding is the resample mechanism does some clever tricks
> with level IRQs, but that QEMU needs to check if the IRQ is still
> asserted by the device on guest EOI.

Yes. the problem is in that area.

> Since a failure here would
> explain these symptoms I'm wondering if there is a bug in either
> QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
> resampled and therefore not re-fired in the guest?

KVM I would say. The assertion state is maintained in KVM for the KVM
XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
device. These are good candidates. I will take a look.

(We might have never tested that path of the code with a passthrough
device using only INTx)

Thanks,

C.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-03-16 16:29   ` Cédric Le Goater
@ 2022-03-16 19:16     ` Cédric Le Goater
  2022-04-13  4:56       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 15+ messages in thread
From: Cédric Le Goater @ 2022-03-16 19:16 UTC (permalink / raw)
  To: Alex Williamson, Timothy Pearson
  Cc: Alexey Kardashevskiy, list@suse.de:PowerPC, qemu-devel

Timothy,

On 3/16/22 17:29, Cédric Le Goater wrote:
> Hello,
> 
> 
>> I've been struggling for some time with what is looking like a
>> potential bug in QEMU/KVM on the POWER9 platform.  It appears that
>> in XIVE mode, when the in-kernel IRQ chip is enabled, an external
>> device that rapidly asserts IRQs via the legacy INTx level mechanism
>> will only receive one interrupt in the KVM guest.
> 
> Indeed. I could reproduce with a pass-through PCI adapter using
> 'pci=nomsi'. The virtio devices operate correctly but the network
> adapter only receives one event (*):
> 
> 
> $ cat /proc/interrupts
>             CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>   16:       2198       1378       1519       1216          0          0          0          0  XIVE-IPI   0 Edge      IPI-0
>   17:          0          0          0          0       2003       1936       1335       1507  XIVE-IPI   1 Edge      IPI-1
>   18:          0       6401          0          0          0          0          0          0  XIVE-IRQ 4609 Level     virtio3, virtio0, virtio2
>   19:          0          0          0          0          0        204          0          0  XIVE-IRQ 4610 Level     virtio1
>   20:          0          0          0          0          0          0          0          0  XIVE-IRQ 4608 Level     xhci-hcd:usb1
>   21:          0          1          0          0          0          0          0          0  XIVE-IRQ 4612 Level     eth1 (*)
>   23:          0          0          0          0          0          0          0          0  XIVE-IRQ 4096 Edge      RAS_EPOW
>   24:          0          0          0          0          0          0          0          0  XIVE-IRQ 4592 Edge      hvc_console
>   26:          0          0          0          0          0          0          0          0  XIVE-IRQ 4097 Edge      RAS_HOTPLUG
> 
>> Changing any one of those items appears to avoid the glitch, e.g. XICS
> 
> XICS is very different from XIVE. The driver implements the previous
> interrupt controller architecture (P5-P8) and the hypervisor mediates
> the delivery to the guest. With XIVE, vCPUs are directly signaled by
> the IC. When under KVM, we use different KVM devices for each mode :
> 
> * KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
>    not using the XIVE native interface. RHEL7 for instance.
> * KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
>    using the XIVE native interface. Linux > 4.14.
> * KVM XICS is for P8 hosts (no XIVE HW)
> 
> VFIO adds some complexity with the source events. I think the problem
> comes from the assertion state. I will talk about it later.
> 
>> mode with the in-kernel IRQ chip works (all interrupts are passed
>> through),
> 
> All interrupts are passed through using XIVE also. Run 'info pic' in
> the monitor. On the host, check the IRQ mapping in the debugfs file :
> 
>    /sys/kernel/debug/powerpc/kvm-xive-*
> 
>> and XIVE mode with the in-kernel IRQ chip disabled also works. 
> 
> In that case, no KVM device backs the QEMU device and all state
> is in one place.
> 
>> We
>> are also not seeing any problems in XIVE mode with the in-kernel
>> chip from MSI/MSI-X devices.
> 
> Yes. pass-through devices are expected to operate correctly :)
> 
>> The device in question is a real time card that needs to raise an
>> interrupt every 1ms.  It works perfectly on the host, but fails in
>> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
>> receives exactly one interrupt, at which point the host continues to
>> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
>> kernel is never reentered.
> 
> ok. Same symptom as the scenario above.
> 
>> We have also seen some very rare glitches where, over a long period
>> of time, we can enter a similar deadlock in XICS mode.
> 
> with the in-kernel XICS IRQ chip ?
> 
>> Disabling
>> the in-kernel IRQ chip in XIVE mode will also lead to the lockup
>> with this device, since the userspace IRQ emulation cannot keep up
>> with the rapid interrupt firing (measurements show around 100ms
>> required for processing each interrupt in the user mode).
> 
> MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
> it is handled as a special case in the device/driver. To maintain the
> assertion state, all LSI handling is done with a special HCALL :
> H_INT_ESB which is implemented in QEMU. This generates a lot of back
> and forth in the KVM stack.
> 
>> My understanding is the resample mechanism does some clever tricks
>> with level IRQs, but that QEMU needs to check if the IRQ is still
>> asserted by the device on guest EOI.
> 
> Yes. the problem is in that area.
> 
>> Since a failure here would
>> explain these symptoms I'm wondering if there is a bug in either
>> QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
>> resampled and therefore not re-fired in the guest?
> 
> KVM I would say. The assertion state is maintained in KVM for the KVM
> XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
> device. These are good candidates. I will take a look.

All works fine with KVM_CAP_IRQFD_RESAMPLE=false in QEMU. Can you please
try this workaround for now ? I could reach 934 Mbits/sec on the passthru
device.

I clearly overlooked that part and it has been 3 years.

Thanks,

C.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-03-14 22:09 ` Alex Williamson
  2022-03-16 16:29   ` Cédric Le Goater
@ 2022-03-22  8:31   ` Alexey Kardashevskiy
  1 sibling, 0 replies; 15+ messages in thread
From: Alexey Kardashevskiy @ 2022-03-22  8:31 UTC (permalink / raw)
  To: Alex Williamson, Timothy Pearson; +Cc: qemu-devel



On 15/03/2022 09:09, Alex Williamson wrote:
> [Cc +Alexey]
> 
> On Fri, 11 Mar 2022 12:35:45 -0600 (CST)
> Timothy Pearson <tpearson@raptorengineering.com> wrote:
> 
>> All,
>>
>> I've been struggling for some time with what is looking like a
>> potential bug in QEMU/KVM on the POWER9 platform.  It appears that in
>> XIVE mode, when the in-kernel IRQ chip is enabled, an external device
>> that rapidly asserts IRQs via the legacy INTx level mechanism will
>> only receive one interrupt in the KVM guest.
>>
>> Changing any one of those items appears to avoid the glitch, e.g.
>> XICS mode with the in-kernel IRQ chip works (all interrupts are
>> passed through), and XIVE mode with the in-kernel IRQ chip disabled
>> also works.  We are also not seeing any problems in XIVE mode with
>> the in-kernel chip from MSI/MSI-X devices.
>>
>> The device in question is a real time card that needs to raise an
>> interrupt every 1ms.  It works perfectly on the host, but fails in
>> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
>> receives exactly one interrupt, at which point the host continues to
>> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
>> kernel is never reentered.
>>
>> We have also seen some very rare glitches where, over a long period
>> of time, we can enter a similar deadlock in XICS mode.  Disabling the
>> in-kernel IRQ chip in XIVE mode will also lead to the lockup with
>> this device, since the userspace IRQ emulation cannot keep up with
>> the rapid interrupt firing (measurements show around 100ms required
>> for processing each interrupt in the user mode).
>>
>> My understanding is the resample mechanism does some clever tricks
>> with level IRQs, but that QEMU needs to check if the IRQ is still
>> asserted by the device on guest EOI.  Since a failure here would
>> explain these symptoms I'm wondering if there is a bug in either QEMU
>> or KVM for POWER / pSeries (SPAPr) where the IRQ is not resampled and
>> therefore not re-fired in the guest?
>>
>> Unfortunately I lack the resources at the moment to dig through the
>> QEMU codebase and try to find the bug.  Any IBMers here that might be
>> able to help out?  I can provide access to a test setup if desired.
> 
> Your experiments with in-kernel vs QEMU irqchip would suggest to me
> that both the device and the generic INTx handling code are working
> correctly, though it's hard to say that definitively given the massive
> timing differences.
> 
> As an experiment, does anything change with the "nointxmask=1" vfio-pci
> module option?
> 
> Adding Alexey, I have zero XIVE knowledge myself. Thanks,

Sorry about the delay, I'll get to it, just need to figure out first the 
host crashes on >128GB vMs on POWER8 with passthrough :-/


--
Alexey


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-03-16 19:16     ` Cédric Le Goater
@ 2022-04-13  4:56       ` Alexey Kardashevskiy
  2022-04-13  6:26         ` Alexey Kardashevskiy
  2022-04-14 12:31         ` Cédric Le Goater
  0 siblings, 2 replies; 15+ messages in thread
From: Alexey Kardashevskiy @ 2022-04-13  4:56 UTC (permalink / raw)
  To: Cédric Le Goater, Alex Williamson, Timothy Pearson
  Cc: Frederic Barrat, list@suse.de:PowerPC, qemu-devel,
	Nicholas Piggin, David Gibson



On 3/17/22 06:16, Cédric Le Goater wrote:
> Timothy,
> 
> On 3/16/22 17:29, Cédric Le Goater wrote:
>> Hello,
>>
>>
>>> I've been struggling for some time with what is looking like a
>>> potential bug in QEMU/KVM on the POWER9 platform.  It appears that
>>> in XIVE mode, when the in-kernel IRQ chip is enabled, an external
>>> device that rapidly asserts IRQs via the legacy INTx level mechanism
>>> will only receive one interrupt in the KVM guest.
>>
>> Indeed. I could reproduce with a pass-through PCI adapter using
>> 'pci=nomsi'. The virtio devices operate correctly but the network
>> adapter only receives one event (*):
>>
>>
>> $ cat /proc/interrupts
>>             CPU0       CPU1       CPU2       CPU3       CPU4       
>> CPU5       CPU6       CPU7
>>   16:       2198       1378       1519       1216          0          
>> 0          0          0  XIVE-IPI   0 Edge      IPI-0
>>   17:          0          0          0          0       2003       
>> 1936       1335       1507  XIVE-IPI   1 Edge      IPI-1
>>   18:          0       6401          0          0          0          
>> 0          0          0  XIVE-IRQ 4609 Level     virtio3, virtio0, 
>> virtio2
>>   19:          0          0          0          0          0        
>> 204          0          0  XIVE-IRQ 4610 Level     virtio1
>>   20:          0          0          0          0          0          
>> 0          0          0  XIVE-IRQ 4608 Level     xhci-hcd:usb1
>>   21:          0          1          0          0          0          
>> 0          0          0  XIVE-IRQ 4612 Level     eth1 (*)
>>   23:          0          0          0          0          0          
>> 0          0          0  XIVE-IRQ 4096 Edge      RAS_EPOW
>>   24:          0          0          0          0          0          
>> 0          0          0  XIVE-IRQ 4592 Edge      hvc_console
>>   26:          0          0          0          0          0          
>> 0          0          0  XIVE-IRQ 4097 Edge      RAS_HOTPLUG
>>
>>> Changing any one of those items appears to avoid the glitch, e.g. XICS
>>
>> XICS is very different from XIVE. The driver implements the previous
>> interrupt controller architecture (P5-P8) and the hypervisor mediates
>> the delivery to the guest. With XIVE, vCPUs are directly signaled by
>> the IC. When under KVM, we use different KVM devices for each mode :
>>
>> * KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
>>    not using the XIVE native interface. RHEL7 for instance.
>> * KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
>>    using the XIVE native interface. Linux > 4.14.
>> * KVM XICS is for P8 hosts (no XIVE HW)
>>
>> VFIO adds some complexity with the source events. I think the problem
>> comes from the assertion state. I will talk about it later.
>>
>>> mode with the in-kernel IRQ chip works (all interrupts are passed
>>> through),
>>
>> All interrupts are passed through using XIVE also. Run 'info pic' in
>> the monitor. On the host, check the IRQ mapping in the debugfs file :
>>
>>    /sys/kernel/debug/powerpc/kvm-xive-*
>>
>>> and XIVE mode with the in-kernel IRQ chip disabled also works. 
>>
>> In that case, no KVM device backs the QEMU device and all state
>> is in one place.
>>
>>> We
>>> are also not seeing any problems in XIVE mode with the in-kernel
>>> chip from MSI/MSI-X devices.
>>
>> Yes. pass-through devices are expected to operate correctly :)
>>
>>> The device in question is a real time card that needs to raise an
>>> interrupt every 1ms.  It works perfectly on the host, but fails in
>>> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
>>> receives exactly one interrupt, at which point the host continues to
>>> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
>>> kernel is never reentered.
>>
>> ok. Same symptom as the scenario above.
>>
>>> We have also seen some very rare glitches where, over a long period
>>> of time, we can enter a similar deadlock in XICS mode.
>>
>> with the in-kernel XICS IRQ chip ?
>>
>>> Disabling
>>> the in-kernel IRQ chip in XIVE mode will also lead to the lockup
>>> with this device, since the userspace IRQ emulation cannot keep up
>>> with the rapid interrupt firing (measurements show around 100ms
>>> required for processing each interrupt in the user mode).
>>
>> MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
>> it is handled as a special case in the device/driver. To maintain the
>> assertion state, all LSI handling is done with a special HCALL :
>> H_INT_ESB which is implemented in QEMU. This generates a lot of back
>> and forth in the KVM stack.
>>
>>> My understanding is the resample mechanism does some clever tricks
>>> with level IRQs, but that QEMU needs to check if the IRQ is still
>>> asserted by the device on guest EOI.
>>
>> Yes. the problem is in that area.
>>
>>> Since a failure here would
>>> explain these symptoms I'm wondering if there is a bug in either
>>> QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
>>> resampled and therefore not re-fired in the guest?
>>
>> KVM I would say. The assertion state is maintained in KVM for the KVM
>> XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
>> device. These are good candidates. I will take a look.
> 
> All works fine with KVM_CAP_IRQFD_RESAMPLE=false in QEMU. Can you please
> try this workaround for now ? I could reach 934 Mbits/sec on the passthru
> device.
> 
> I clearly overlooked that part and it has been 3 years.


Disabling KVM_CAP_IRQFD_RESAMPLE on XIVE-capable machines seems to be 
the right fix actually.

XIVE == baremetal/vm POWER9 and newer.
XICS == baremetal/vm POWER8 and older, or VMs on any POWER (backward 
compat.).

Tested on POWER9 with a passed through XHCI host and "-append pci=nomsi" 
and "-machine pseries,ic-mode=xics,kernel_irqchip=on" (and s/xics/xive/).

When it is XIVE-on-XIVE (host and guest are XIVE), INTx is emulated in 
the QEMU's H_INT_ESB handler and IRQFD_RESAMPLE is just useless in such 
case (as it is designed to eliminate going to the userspace for the 
EOI->INTx unmasking) and there is no pathway to call the eventfd's 
irqfd_resampler_ack() from QEMU. So the VM's XHCI device receives 
exactly 1 interrupt and that is it. "kernel_irqchip=off" fixes it 
(obviously).

When it is XICS-on-XIVE (host is XIVE and guest is XICS), then the VM 
receives 100000 interrupts and then it gets frozen (__report_bad_irq() 
is called). Which happens because (unlike XICS-on-XICS), the host XIVE's 
xive_(rm|vm)_h_eoi() does not call irqfd_resampler_ack(). This fixes it:

=============
diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
b/arch/powerpc/kvm/book3s_xive_template.c
index b0015e05d99a..9f0d8e5c7f4b 100644 

--- a/arch/powerpc/kvm/book3s_xive_template.c 

+++ b/arch/powerpc/kvm/book3s_xive_template.c 

@@ -595,6 +595,8 @@ X_STATIC int GLUE(X_PFX,h_eoi)(struct kvm_vcpu 
*vcpu, unsigned long xirr)
         xc->hw_cppr = xc->cppr; 

         __x_writeb(xc->cppr, __x_tima + TM_QW1_OS + TM_CPPR); 

 

+       kvm_notify_acked_irq(vcpu->kvm, 0, irq); 

+ 

         return rc; 

  } 

=============

The host's XICS does call kvm_notify_acked_irq() (I did not test that 
but the code seems to be doing so).

After re-reading what I just wrote, I am leaning towards disabling use 
of KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never 
since :)

Did I miss something in the picture (hey Cedric)?


--
Alexey


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-04-13  4:56       ` Alexey Kardashevskiy
@ 2022-04-13  6:26         ` Alexey Kardashevskiy
  2022-04-14 12:41           ` Cédric Le Goater
  2022-04-14 12:31         ` Cédric Le Goater
  1 sibling, 1 reply; 15+ messages in thread
From: Alexey Kardashevskiy @ 2022-04-13  6:26 UTC (permalink / raw)
  To: Cédric Le Goater, Alex Williamson, Timothy Pearson
  Cc: Frederic Barrat, list@suse.de:PowerPC, qemu-devel,
	Nicholas Piggin, David Gibson



On 4/13/22 14:56, Alexey Kardashevskiy wrote:
> 
> 
> On 3/17/22 06:16, Cédric Le Goater wrote:
>> Timothy,
>>
>> On 3/16/22 17:29, Cédric Le Goater wrote:
>>> Hello,
>>>
>>>
>>>> I've been struggling for some time with what is looking like a
>>>> potential bug in QEMU/KVM on the POWER9 platform.  It appears that
>>>> in XIVE mode, when the in-kernel IRQ chip is enabled, an external
>>>> device that rapidly asserts IRQs via the legacy INTx level mechanism
>>>> will only receive one interrupt in the KVM guest.
>>>
>>> Indeed. I could reproduce with a pass-through PCI adapter using
>>> 'pci=nomsi'. The virtio devices operate correctly but the network
>>> adapter only receives one event (*):
>>>
>>>
>>> $ cat /proc/interrupts
>>>             CPU0       CPU1       CPU2       CPU3       CPU4 
>>> CPU5       CPU6       CPU7
>>>   16:       2198       1378       1519       1216          0 
>>> 0          0          0  XIVE-IPI   0 Edge      IPI-0
>>>   17:          0          0          0          0       2003 
>>> 1936       1335       1507  XIVE-IPI   1 Edge      IPI-1
>>>   18:          0       6401          0          0          0 
>>> 0          0          0  XIVE-IRQ 4609 Level     virtio3, virtio0, 
>>> virtio2
>>>   19:          0          0          0          0          0 
>>> 204          0          0  XIVE-IRQ 4610 Level     virtio1
>>>   20:          0          0          0          0          0 
>>> 0          0          0  XIVE-IRQ 4608 Level     xhci-hcd:usb1
>>>   21:          0          1          0          0          0 
>>> 0          0          0  XIVE-IRQ 4612 Level     eth1 (*)
>>>   23:          0          0          0          0          0 
>>> 0          0          0  XIVE-IRQ 4096 Edge      RAS_EPOW
>>>   24:          0          0          0          0          0 
>>> 0          0          0  XIVE-IRQ 4592 Edge      hvc_console
>>>   26:          0          0          0          0          0 
>>> 0          0          0  XIVE-IRQ 4097 Edge      RAS_HOTPLUG
>>>
>>>> Changing any one of those items appears to avoid the glitch, e.g. XICS
>>>
>>> XICS is very different from XIVE. The driver implements the previous
>>> interrupt controller architecture (P5-P8) and the hypervisor mediates
>>> the delivery to the guest. With XIVE, vCPUs are directly signaled by
>>> the IC. When under KVM, we use different KVM devices for each mode :
>>>
>>> * KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
>>>    not using the XIVE native interface. RHEL7 for instance.
>>> * KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
>>>    using the XIVE native interface. Linux > 4.14.
>>> * KVM XICS is for P8 hosts (no XIVE HW)
>>>
>>> VFIO adds some complexity with the source events. I think the problem
>>> comes from the assertion state. I will talk about it later.
>>>
>>>> mode with the in-kernel IRQ chip works (all interrupts are passed
>>>> through),
>>>
>>> All interrupts are passed through using XIVE also. Run 'info pic' in
>>> the monitor. On the host, check the IRQ mapping in the debugfs file :
>>>
>>>    /sys/kernel/debug/powerpc/kvm-xive-*
>>>
>>>> and XIVE mode with the in-kernel IRQ chip disabled also works. 
>>>
>>> In that case, no KVM device backs the QEMU device and all state
>>> is in one place.
>>>
>>>> We
>>>> are also not seeing any problems in XIVE mode with the in-kernel
>>>> chip from MSI/MSI-X devices.
>>>
>>> Yes. pass-through devices are expected to operate correctly :)
>>>
>>>> The device in question is a real time card that needs to raise an
>>>> interrupt every 1ms.  It works perfectly on the host, but fails in
>>>> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
>>>> receives exactly one interrupt, at which point the host continues to
>>>> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
>>>> kernel is never reentered.
>>>
>>> ok. Same symptom as the scenario above.
>>>
>>>> We have also seen some very rare glitches where, over a long period
>>>> of time, we can enter a similar deadlock in XICS mode.
>>>
>>> with the in-kernel XICS IRQ chip ?
>>>
>>>> Disabling
>>>> the in-kernel IRQ chip in XIVE mode will also lead to the lockup
>>>> with this device, since the userspace IRQ emulation cannot keep up
>>>> with the rapid interrupt firing (measurements show around 100ms
>>>> required for processing each interrupt in the user mode).
>>>
>>> MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
>>> it is handled as a special case in the device/driver. To maintain the
>>> assertion state, all LSI handling is done with a special HCALL :
>>> H_INT_ESB which is implemented in QEMU. This generates a lot of back
>>> and forth in the KVM stack.
>>>
>>>> My understanding is the resample mechanism does some clever tricks
>>>> with level IRQs, but that QEMU needs to check if the IRQ is still
>>>> asserted by the device on guest EOI.
>>>
>>> Yes. the problem is in that area.
>>>
>>>> Since a failure here would
>>>> explain these symptoms I'm wondering if there is a bug in either
>>>> QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
>>>> resampled and therefore not re-fired in the guest?
>>>
>>> KVM I would say. The assertion state is maintained in KVM for the KVM
>>> XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
>>> device. These are good candidates. I will take a look.
>>
>> All works fine with KVM_CAP_IRQFD_RESAMPLE=false in QEMU. Can you please
>> try this workaround for now ? I could reach 934 Mbits/sec on the passthru
>> device.
>>
>> I clearly overlooked that part and it has been 3 years.
> 
> 
> Disabling KVM_CAP_IRQFD_RESAMPLE on XIVE-capable machines seems to be 
> the right fix actually.
> 
> XIVE == baremetal/vm POWER9 and newer.
> XICS == baremetal/vm POWER8 and older, or VMs on any POWER (backward 
> compat.).
> 
> Tested on POWER9 with a passed through XHCI host and "-append pci=nomsi" 
> and "-machine pseries,ic-mode=xics,kernel_irqchip=on" (and s/xics/xive/).
> 
> When it is XIVE-on-XIVE (host and guest are XIVE), INTx is emulated in 
> the QEMU's H_INT_ESB handler and IRQFD_RESAMPLE is just useless in such 
> case (as it is designed to eliminate going to the userspace for the 
> EOI->INTx unmasking) and there is no pathway to call the eventfd's 
> irqfd_resampler_ack() from QEMU. So the VM's XHCI device receives 
> exactly 1 interrupt and that is it. "kernel_irqchip=off" fixes it 
> (obviously).
> 
> When it is XICS-on-XIVE (host is XIVE and guest is XICS), then the VM 
> receives 100000 interrupts and then it gets frozen (__report_bad_irq() 
> is called). Which happens because (unlike XICS-on-XICS), the host XIVE's 
> xive_(rm|vm)_h_eoi() does not call irqfd_resampler_ack(). This fixes it:
> 
> =============
> diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
> b/arch/powerpc/kvm/book3s_xive_template.c
> index b0015e05d99a..9f0d8e5c7f4b 100644
> --- a/arch/powerpc/kvm/book3s_xive_template.c
> +++ b/arch/powerpc/kvm/book3s_xive_template.c
> @@ -595,6 +595,8 @@ X_STATIC int GLUE(X_PFX,h_eoi)(struct kvm_vcpu 
> *vcpu, unsigned long xirr)
>          xc->hw_cppr = xc->cppr;
>          __x_writeb(xc->cppr, __x_tima + TM_QW1_OS + TM_CPPR);
> 
> 
> +       kvm_notify_acked_irq(vcpu->kvm, 0, irq);
> +
>          return rc;
>   }
> =============
> 
> The host's XICS does call kvm_notify_acked_irq() (I did not test that 
> but the code seems to be doing so).
> 
> After re-reading what I just wrote, I am leaning towards disabling use 
> of KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never 
> since :)
> 
> Did I miss something in the picture (hey Cedric)?



How about disabling it like this?

=====
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 5bfd4aa9e5aa..c999f7b1ab1b 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -732,7 +732,7 @@ static PCIINTxRoute spapr_route_intx_pin_to_irq(void 
*opaque, int pin)
      SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(opaque);
      PCIINTxRoute route;

-    route.mode = PCI_INTX_ENABLED;
+    route.mode = PCI_INTX_DISABLED;

=====


(btw what the heck is PCI_INTX_INVERTED for?)


--
Alexey


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-04-13  4:56       ` Alexey Kardashevskiy
  2022-04-13  6:26         ` Alexey Kardashevskiy
@ 2022-04-14 12:31         ` Cédric Le Goater
  2022-04-19  1:55           ` Alexey Kardashevskiy
  1 sibling, 1 reply; 15+ messages in thread
From: Cédric Le Goater @ 2022-04-14 12:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson, Timothy Pearson
  Cc: Frederic Barrat, list@suse.de:PowerPC, qemu-devel,
	Nicholas Piggin, David Gibson

Hello Alexey,

Thanks for taking over.


On 4/13/22 06:56, Alexey Kardashevskiy wrote:
> 
> 
> On 3/17/22 06:16, Cédric Le Goater wrote:
>> Timothy,
>>
>> On 3/16/22 17:29, Cédric Le Goater wrote:
>>> Hello,
>>>
>>>
>>>> I've been struggling for some time with what is looking like a
>>>> potential bug in QEMU/KVM on the POWER9 platform.  It appears that
>>>> in XIVE mode, when the in-kernel IRQ chip is enabled, an external
>>>> device that rapidly asserts IRQs via the legacy INTx level mechanism
>>>> will only receive one interrupt in the KVM guest.
>>>
>>> Indeed. I could reproduce with a pass-through PCI adapter using
>>> 'pci=nomsi'. The virtio devices operate correctly but the network
>>> adapter only receives one event (*):
>>>
>>>
>>> $ cat /proc/interrupts
>>>             CPU0       CPU1       CPU2       CPU3       CPU4 CPU5       CPU6       CPU7
>>>   16:       2198       1378       1519       1216          0 0          0          0  XIVE-IPI   0 Edge      IPI-0
>>>   17:          0          0          0          0       2003 1936       1335       1507  XIVE-IPI   1 Edge      IPI-1
>>>   18:          0       6401          0          0          0 0          0          0  XIVE-IRQ 4609 Level     virtio3, virtio0, virtio2
>>>   19:          0          0          0          0          0 204          0          0  XIVE-IRQ 4610 Level     virtio1
>>>   20:          0          0          0          0          0 0          0          0  XIVE-IRQ 4608 Level     xhci-hcd:usb1
>>>   21:          0          1          0          0          0 0          0          0  XIVE-IRQ 4612 Level     eth1 (*)
>>>   23:          0          0          0          0          0 0          0          0  XIVE-IRQ 4096 Edge      RAS_EPOW
>>>   24:          0          0          0          0          0 0          0          0  XIVE-IRQ 4592 Edge      hvc_console
>>>   26:          0          0          0          0          0 0          0          0  XIVE-IRQ 4097 Edge      RAS_HOTPLUG
>>>
>>>> Changing any one of those items appears to avoid the glitch, e.g. XICS
>>>
>>> XICS is very different from XIVE. The driver implements the previous
>>> interrupt controller architecture (P5-P8) and the hypervisor mediates
>>> the delivery to the guest. With XIVE, vCPUs are directly signaled by
>>> the IC. When under KVM, we use different KVM devices for each mode :
>>>
>>> * KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
>>>    not using the XIVE native interface. RHEL7 for instance.
>>> * KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
>>>    using the XIVE native interface. Linux > 4.14.
>>> * KVM XICS is for P8 hosts (no XIVE HW)
>>>
>>> VFIO adds some complexity with the source events. I think the problem
>>> comes from the assertion state. I will talk about it later.
>>>
>>>> mode with the in-kernel IRQ chip works (all interrupts are passed
>>>> through),
>>>
>>> All interrupts are passed through using XIVE also. Run 'info pic' in
>>> the monitor. On the host, check the IRQ mapping in the debugfs file :
>>>
>>>    /sys/kernel/debug/powerpc/kvm-xive-*
>>>
>>>> and XIVE mode with the in-kernel IRQ chip disabled also works. 
>>>
>>> In that case, no KVM device backs the QEMU device and all state
>>> is in one place.
>>>
>>>> We
>>>> are also not seeing any problems in XIVE mode with the in-kernel
>>>> chip from MSI/MSI-X devices.
>>>
>>> Yes. pass-through devices are expected to operate correctly :)
>>>
>>>> The device in question is a real time card that needs to raise an
>>>> interrupt every 1ms.  It works perfectly on the host, but fails in
>>>> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
>>>> receives exactly one interrupt, at which point the host continues to
>>>> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
>>>> kernel is never reentered.
>>>
>>> ok. Same symptom as the scenario above.
>>>
>>>> We have also seen some very rare glitches where, over a long period
>>>> of time, we can enter a similar deadlock in XICS mode.
>>>
>>> with the in-kernel XICS IRQ chip ?
>>>
>>>> Disabling
>>>> the in-kernel IRQ chip in XIVE mode will also lead to the lockup
>>>> with this device, since the userspace IRQ emulation cannot keep up
>>>> with the rapid interrupt firing (measurements show around 100ms
>>>> required for processing each interrupt in the user mode).
>>>
>>> MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
>>> it is handled as a special case in the device/driver. To maintain the
>>> assertion state, all LSI handling is done with a special HCALL :
>>> H_INT_ESB which is implemented in QEMU. This generates a lot of back
>>> and forth in the KVM stack.
>>>
>>>> My understanding is the resample mechanism does some clever tricks
>>>> with level IRQs, but that QEMU needs to check if the IRQ is still
>>>> asserted by the device on guest EOI.
>>>
>>> Yes. the problem is in that area.
>>>
>>>> Since a failure here would
>>>> explain these symptoms I'm wondering if there is a bug in either
>>>> QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
>>>> resampled and therefore not re-fired in the guest?
>>>
>>> KVM I would say. The assertion state is maintained in KVM for the KVM
>>> XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
>>> device. These are good candidates. I will take a look.
>>
>> All works fine with KVM_CAP_IRQFD_RESAMPLE=false in QEMU. Can you please
>> try this workaround for now ? I could reach 934 Mbits/sec on the passthru
>> device.
>>
>> I clearly overlooked that part and it has been 3 years.
> 
> 
> Disabling KVM_CAP_IRQFD_RESAMPLE on XIVE-capable machines seems to be the right fix actually.
> 
> XIVE == baremetal/vm POWER9 and newer.
> XICS == baremetal/vm POWER8 and older, or VMs on any POWER (backward compat.).

yes. You can force XICS on POWER9 using 'max-cpu-compat' or 'ic-mode'.

> Tested on POWER9 with a passed through XHCI host and "-append pci=nomsi" and "-machine pseries,ic-mode=xics,kernel_irqchip=on" (and s/xics/xive/).

ok. This is deactivating the default XIVE (P9+) mode at the platform level,
hence forcing the XICS (P8) mode in a POWER9 guest running on a POWER9 host.
It is also deactivating MSI, forcing INTx usage in the kernel and forcing
the use of the KVM irqchip device to make sure we are not emulating in QEMU.

We are far from the default scenario but this is it !

> When it is XIVE-on-XIVE (host and guest are XIVE), 

We call this mode : XIVE native, or exploitation, mode. Anyhow, it is always
XIVE under the hood on a POWER9/POWER10 box.

> INTx is emulated in the QEMU's H_INT_ESB handler 

LSI are indeed all handled at the QEMU level with the H_INT_ESB hcall.
If I remember well, this is because we wanted a simple way to synthesize
the interrupt trigger upon EOI when the level is still asserted. Doing
this way is compatible for both kernel_irqchip=off/on modes because the
level is maintained in QEMU.

This is different for the other two XICS KVM devices which maintain the
assertion level in KVM.

> and IRQFD_RESAMPLE is just useless in such case (as it is designed to eliminate going to the userspace for the EOI->INTx unmasking) and there is no pathway to call the eventfd's irqfd_resampler_ack() from QEMU. So the VM's XHCI device receives exactly 1 interrupt and that is it. "kernel_irqchip=off" fixes it (obviously).

yes.

> When it is XICS-on-XIVE (host is XIVE and guest is XICS), 

yes (FYI, we have similar glue in skiboot ...)

> then the VM receives 100000 interrupts and then it gets frozen (__report_bad_irq() is called). Which happens because (unlike XICS-on-XICS), the host XIVE's xive_(rm|vm)_h_eoi() does not call irqfd_resampler_ack(). This fixes it:
> 
> =============
> diff --git a/arch/powerpc/kvm/book3s_xive_template.c b/arch/powerpc/kvm/book3s_xive_template.c
> index b0015e05d99a..9f0d8e5c7f4b 100644
> --- a/arch/powerpc/kvm/book3s_xive_template.c
> +++ b/arch/powerpc/kvm/book3s_xive_template.c
> @@ -595,6 +595,8 @@ X_STATIC int GLUE(X_PFX,h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr)
>          xc->hw_cppr = xc->cppr;
>          __x_writeb(xc->cppr, __x_tima + TM_QW1_OS + TM_CPPR);
> 
> 
> +       kvm_notify_acked_irq(vcpu->kvm, 0, irq);
> +
>          return rc;
>   }
> =============

OK. XICS-on-XIVE is also broken then :/ what about XIVE-on-XIVE ?

> 
> The host's XICS does call kvm_notify_acked_irq() (I did not test that but the code seems to be doing so).
> 
> After re-reading what I just wrote, I am leaning towards disabling use of KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never since :)

and it would fix XIVE-on-XIVE.

Are you saying that passthru on POWER8 is broken ? fully or only INTx ?

> Did I miss something in the picture (hey Cedric)?

You seem to have all combination in mind: host OS, KVM, QEMU, guest OS

For the record, here is a documentation we did:

   https://qemu.readthedocs.io/en/latest/specs/ppc-spapr-xive.html

It might need some updates.

Thanks,

C.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-04-13  6:26         ` Alexey Kardashevskiy
@ 2022-04-14 12:41           ` Cédric Le Goater
  2022-04-21  3:07             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 15+ messages in thread
From: Cédric Le Goater @ 2022-04-14 12:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson, Timothy Pearson
  Cc: Frederic Barrat, list@suse.de:PowerPC, qemu-devel,
	Nicholas Piggin, David Gibson


>> After re-reading what I just wrote, I am leaning towards disabling use of KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never since :)
>>
>> Did I miss something in the picture (hey Cedric)?
> 
> How about disabling it like this?
> 
> =====
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 5bfd4aa9e5aa..c999f7b1ab1b 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -732,7 +732,7 @@ static PCIINTxRoute spapr_route_intx_pin_to_irq(void *opaque, int pin)
>       SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(opaque);
>       PCIINTxRoute route;
> 
> -    route.mode = PCI_INTX_ENABLED;
> +    route.mode = PCI_INTX_DISABLED;
> 
> =====

I like it.


You now know how to test all the combinations :) Prepare your matrix,
variables are :

  * Host OS		POWER8, POWER9+
  * KVM device		XICS (P8), XICS-on-XIVE (P9), XIVE-on-XIVE (P9)
  * kernel_irqchip	off, on
  * ic-mode		xics, xive
  * Guest OS		msi or nomsi

Ideally you should check TCG, but that's like kernel_irqchip=off.

Cheers,

C.
  
> 
> (btw what the heck is PCI_INTX_INVERTED for?)
> 
> 
> -- 
> Alexey



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-04-14 12:31         ` Cédric Le Goater
@ 2022-04-19  1:55           ` Alexey Kardashevskiy
  2022-04-19  7:35             ` Cédric Le Goater
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Kardashevskiy @ 2022-04-19  1:55 UTC (permalink / raw)
  To: Cédric Le Goater, Alex Williamson, Timothy Pearson
  Cc: Frederic Barrat, list@suse.de:PowerPC, qemu-devel,
	Nicholas Piggin, David Gibson



On 14/04/2022 22:31, Cédric Le Goater wrote:
> Hello Alexey,
> 
> Thanks for taking over.
> 
> 
> On 4/13/22 06:56, Alexey Kardashevskiy wrote:
>>
>>
>> On 3/17/22 06:16, Cédric Le Goater wrote:
>>> Timothy,
>>>
>>> On 3/16/22 17:29, Cédric Le Goater wrote:
>>>> Hello,
>>>>
>>>>
>>>>> I've been struggling for some time with what is looking like a
>>>>> potential bug in QEMU/KVM on the POWER9 platform.  It appears that
>>>>> in XIVE mode, when the in-kernel IRQ chip is enabled, an external
>>>>> device that rapidly asserts IRQs via the legacy INTx level mechanism
>>>>> will only receive one interrupt in the KVM guest.
>>>>
>>>> Indeed. I could reproduce with a pass-through PCI adapter using
>>>> 'pci=nomsi'. The virtio devices operate correctly but the network
>>>> adapter only receives one event (*):
>>>>
>>>>
>>>> $ cat /proc/interrupts
>>>>             CPU0       CPU1       CPU2       CPU3       CPU4 
>>>> CPU5       CPU6       CPU7
>>>>   16:       2198       1378       1519       1216          0 
>>>> 0          0          0  XIVE-IPI   0 Edge      IPI-0
>>>>   17:          0          0          0          0       2003 
>>>> 1936       1335       1507  XIVE-IPI   1 Edge      IPI-1
>>>>   18:          0       6401          0          0          0 
>>>> 0          0          0  XIVE-IRQ 4609 Level     virtio3, virtio0, 
>>>> virtio2
>>>>   19:          0          0          0          0          0 
>>>> 204          0          0  XIVE-IRQ 4610 Level     virtio1
>>>>   20:          0          0          0          0          0 
>>>> 0          0          0  XIVE-IRQ 4608 Level     xhci-hcd:usb1
>>>>   21:          0          1          0          0          0 
>>>> 0          0          0  XIVE-IRQ 4612 Level     eth1 (*)
>>>>   23:          0          0          0          0          0 
>>>> 0          0          0  XIVE-IRQ 4096 Edge      RAS_EPOW
>>>>   24:          0          0          0          0          0 
>>>> 0          0          0  XIVE-IRQ 4592 Edge      hvc_console
>>>>   26:          0          0          0          0          0 
>>>> 0          0          0  XIVE-IRQ 4097 Edge      RAS_HOTPLUG
>>>>
>>>>> Changing any one of those items appears to avoid the glitch, e.g. XICS
>>>>
>>>> XICS is very different from XIVE. The driver implements the previous
>>>> interrupt controller architecture (P5-P8) and the hypervisor mediates
>>>> the delivery to the guest. With XIVE, vCPUs are directly signaled by
>>>> the IC. When under KVM, we use different KVM devices for each mode :
>>>>
>>>> * KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
>>>>    not using the XIVE native interface. RHEL7 for instance.
>>>> * KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
>>>>    using the XIVE native interface. Linux > 4.14.
>>>> * KVM XICS is for P8 hosts (no XIVE HW)
>>>>
>>>> VFIO adds some complexity with the source events. I think the problem
>>>> comes from the assertion state. I will talk about it later.
>>>>
>>>>> mode with the in-kernel IRQ chip works (all interrupts are passed
>>>>> through),
>>>>
>>>> All interrupts are passed through using XIVE also. Run 'info pic' in
>>>> the monitor. On the host, check the IRQ mapping in the debugfs file :
>>>>
>>>>    /sys/kernel/debug/powerpc/kvm-xive-*
>>>>
>>>>> and XIVE mode with the in-kernel IRQ chip disabled also works. 
>>>>
>>>> In that case, no KVM device backs the QEMU device and all state
>>>> is in one place.
>>>>
>>>>> We
>>>>> are also not seeing any problems in XIVE mode with the in-kernel
>>>>> chip from MSI/MSI-X devices.
>>>>
>>>> Yes. pass-through devices are expected to operate correctly :)
>>>>
>>>>> The device in question is a real time card that needs to raise an
>>>>> interrupt every 1ms.  It works perfectly on the host, but fails in
>>>>> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
>>>>> receives exactly one interrupt, at which point the host continues to
>>>>> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
>>>>> kernel is never reentered.
>>>>
>>>> ok. Same symptom as the scenario above.
>>>>
>>>>> We have also seen some very rare glitches where, over a long period
>>>>> of time, we can enter a similar deadlock in XICS mode.
>>>>
>>>> with the in-kernel XICS IRQ chip ?
>>>>
>>>>> Disabling
>>>>> the in-kernel IRQ chip in XIVE mode will also lead to the lockup
>>>>> with this device, since the userspace IRQ emulation cannot keep up
>>>>> with the rapid interrupt firing (measurements show around 100ms
>>>>> required for processing each interrupt in the user mode).
>>>>
>>>> MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
>>>> it is handled as a special case in the device/driver. To maintain the
>>>> assertion state, all LSI handling is done with a special HCALL :
>>>> H_INT_ESB which is implemented in QEMU. This generates a lot of back
>>>> and forth in the KVM stack.
>>>>
>>>>> My understanding is the resample mechanism does some clever tricks
>>>>> with level IRQs, but that QEMU needs to check if the IRQ is still
>>>>> asserted by the device on guest EOI.
>>>>
>>>> Yes. the problem is in that area.
>>>>
>>>>> Since a failure here would
>>>>> explain these symptoms I'm wondering if there is a bug in either
>>>>> QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
>>>>> resampled and therefore not re-fired in the guest?
>>>>
>>>> KVM I would say. The assertion state is maintained in KVM for the KVM
>>>> XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
>>>> device. These are good candidates. I will take a look.
>>>
>>> All works fine with KVM_CAP_IRQFD_RESAMPLE=false in QEMU. Can you please
>>> try this workaround for now ? I could reach 934 Mbits/sec on the 
>>> passthru
>>> device.
>>>
>>> I clearly overlooked that part and it has been 3 years.
>>
>>
>> Disabling KVM_CAP_IRQFD_RESAMPLE on XIVE-capable machines seems to be 
>> the right fix actually.
>>
>> XIVE == baremetal/vm POWER9 and newer.
>> XICS == baremetal/vm POWER8 and older, or VMs on any POWER (backward 
>> compat.).
> 
> yes. You can force XICS on POWER9 using 'max-cpu-compat' or 'ic-mode'.
> 
>> Tested on POWER9 with a passed through XHCI host and "-append 
>> pci=nomsi" and "-machine pseries,ic-mode=xics,kernel_irqchip=on" (and 
>> s/xics/xive/).
> 
> ok. This is deactivating the default XIVE (P9+) mode at the platform level,
> hence forcing the XICS (P8) mode in a POWER9 guest running on a POWER9 
> host.
> It is also deactivating MSI, forcing INTx usage in the kernel and forcing
> the use of the KVM irqchip device to make sure we are not emulating in 
> QEMU.
> 
> We are far from the default scenario but this is it !


well, "-machine pseries,ic-mode=xive,kernel_irqchip=on" is the default. 
"pci=nomsi" is not but since that actual device is only capable on INTx, 
the default settings expose the problem.



>> When it is XIVE-on-XIVE (host and guest are XIVE), 
> 
> We call this mode : XIVE native, or exploitation, mode. Anyhow, it is 
> always
> XIVE under the hood on a POWER9/POWER10 box.
> 
>> INTx is emulated in the QEMU's H_INT_ESB handler 
> 
> LSI are indeed all handled at the QEMU level with the H_INT_ESB hcall.
> If I remember well, this is because we wanted a simple way to synthesize
> the interrupt trigger upon EOI when the level is still asserted. Doing
> this way is compatible for both kernel_irqchip=off/on modes because the
> level is maintained in QEMU.
> 
> This is different for the other two XICS KVM devices which maintain the
> assertion level in KVM.
> 
>> and IRQFD_RESAMPLE is just useless in such case (as it is designed to 
>> eliminate going to the userspace for the EOI->INTx unmasking) and 
>> there is no pathway to call the eventfd's irqfd_resampler_ack() from 
>> QEMU. So the VM's XHCI device receives exactly 1 interrupt and that is 
>> it. "kernel_irqchip=off" fixes it (obviously).
> 
> yes.
> 
>> When it is XICS-on-XIVE (host is XIVE and guest is XICS), 
> 
> yes (FYI, we have similar glue in skiboot ...)
> 
>> then the VM receives 100000 interrupts and then it gets frozen 
>> (__report_bad_irq() is called). Which happens because (unlike 
>> XICS-on-XICS), the host XIVE's xive_(rm|vm)_h_eoi() does not call 
>> irqfd_resampler_ack(). This fixes it:
>>
>> =============
>> diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
>> b/arch/powerpc/kvm/book3s_xive_template.c
>> index b0015e05d99a..9f0d8e5c7f4b 100644
>> --- a/arch/powerpc/kvm/book3s_xive_template.c
>> +++ b/arch/powerpc/kvm/book3s_xive_template.c
>> @@ -595,6 +595,8 @@ X_STATIC int GLUE(X_PFX,h_eoi)(struct kvm_vcpu 
>> *vcpu, unsigned long xirr)
>>          xc->hw_cppr = xc->cppr;
>>          __x_writeb(xc->cppr, __x_tima + TM_QW1_OS + TM_CPPR);
>>
>>
>> +       kvm_notify_acked_irq(vcpu->kvm, 0, irq);
>> +
>>          return rc;
>>   }
>> =============
> 
> OK. XICS-on-XIVE is also broken then :/ what about XIVE-on-XIVE ?


Not sure I am following (or you are) :) INTx is broken on P9 in either 
mode. MSI works in both.

> 
>>
>> The host's XICS does call kvm_notify_acked_irq() (I did not test that 
>> but the code seems to be doing so).
>>
>> After re-reading what I just wrote, I am leaning towards disabling use 
>> of KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never 
>> since :)
> 
> and it would fix XIVE-on-XIVE.
> 
> Are you saying that passthru on POWER8 is broken ? fully or only INTx ?


No, the opposite - P8 works fine, kvm_notify_acked_irq() is there.


>> Did I miss something in the picture (hey Cedric)?
> 
> You seem to have all combination in mind: host OS, KVM, QEMU, guest OS
> 
> For the record, here is a documentation we did:
> 
>    https://qemu.readthedocs.io/en/latest/specs/ppc-spapr-xive.html
> 
> It might need some updates.

When I read this, a quite from the Simpsons pops up in my mind: “Dear 
Mr. President there are too many states nowadays. Please eliminate 
three. I am NOT a crackpot.” :)

> 
> Thanks,
> 
> C.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-04-19  1:55           ` Alexey Kardashevskiy
@ 2022-04-19  7:35             ` Cédric Le Goater
  0 siblings, 0 replies; 15+ messages in thread
From: Cédric Le Goater @ 2022-04-19  7:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson, Timothy Pearson
  Cc: Frederic Barrat, list@suse.de:PowerPC, qemu-devel,
	Nicholas Piggin, David Gibson

>>> Tested on POWER9 with a passed through XHCI host and "-append pci=nomsi" and "-machine pseries,ic-mode=xics,kernel_irqchip=on" (and s/xics/xive/).
>>
>> ok. This is deactivating the default XIVE (P9+) mode at the platform level,
>> hence forcing the XICS (P8) mode in a POWER9 guest running on a POWER9 host.
>> It is also deactivating MSI, forcing INTx usage in the kernel and forcing
>> the use of the KVM irqchip device to make sure we are not emulating in QEMU.
>>
>> We are far from the default scenario but this is it !
> 
> 
> well, "-machine pseries,ic-mode=xive,kernel_irqchip=on" is the default. 

The default is a 'dual' ic-mode, so XICS+XIVE are announced by CAS.
kernel_irqchip is not strictly enforced, so QEMU could fallback to
an emulated irqchip if needed.

> "pci=nomsi" is not but since that actual device is only capable on INTx,
> the default settings expose the problem.
> 
> 
> 
>>> When it is XIVE-on-XIVE (host and guest are XIVE), 
>>
>> We call this mode : XIVE native, or exploitation, mode. Anyhow, it is always
>> XIVE under the hood on a POWER9/POWER10 box.
>>
>>> INTx is emulated in the QEMU's H_INT_ESB handler 
>>
>> LSI are indeed all handled at the QEMU level with the H_INT_ESB hcall.
>> If I remember well, this is because we wanted a simple way to synthesize
>> the interrupt trigger upon EOI when the level is still asserted. Doing
>> this way is compatible for both kernel_irqchip=off/on modes because the
>> level is maintained in QEMU.
>>
>> This is different for the other two XICS KVM devices which maintain the
>> assertion level in KVM.
>>
>>> and IRQFD_RESAMPLE is just useless in such case (as it is designed to eliminate going to the userspace for the EOI->INTx unmasking) and there is no pathway to call the eventfd's irqfd_resampler_ack() from QEMU. So the VM's XHCI device receives exactly 1 interrupt and that is it. "kernel_irqchip=off" fixes it (obviously).
>>
>> yes.
>>
>>> When it is XICS-on-XIVE (host is XIVE and guest is XICS), 
>>
>> yes (FYI, we have similar glue in skiboot ...)
>>
>>> then the VM receives 100000 interrupts and then it gets frozen (__report_bad_irq() is called). Which happens because (unlike XICS-on-XICS), the host XIVE's xive_(rm|vm)_h_eoi() does not call irqfd_resampler_ack(). This fixes it:
>>>
>>> =============
>>> diff --git a/arch/powerpc/kvm/book3s_xive_template.c b/arch/powerpc/kvm/book3s_xive_template.c
>>> index b0015e05d99a..9f0d8e5c7f4b 100644
>>> --- a/arch/powerpc/kvm/book3s_xive_template.c
>>> +++ b/arch/powerpc/kvm/book3s_xive_template.c
>>> @@ -595,6 +595,8 @@ X_STATIC int GLUE(X_PFX,h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr)
>>>          xc->hw_cppr = xc->cppr;
>>>          __x_writeb(xc->cppr, __x_tima + TM_QW1_OS + TM_CPPR);
>>>
>>>
>>> +       kvm_notify_acked_irq(vcpu->kvm, 0, irq);
>>> +
>>>          return rc;
>>>   }
>>> =============
>>
>> OK. XICS-on-XIVE is also broken then :/ what about XIVE-on-XIVE ?
> 
> 
> Not sure I am following (or you are) :) INTx is broken on P9 in either mode. MSI works in both.

Sorry my question was not clear. the above fixed XICS-on-XIVE but
not XIVE-on-XIVE and I was asking about that. disabling resample
seems to be the solution for all.

>>
>>>
>>> The host's XICS does call kvm_notify_acked_irq() (I did not test that but the code seems to be doing so).
>>>
>>> After re-reading what I just wrote, I am leaning towards disabling use of KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never since :)
>>
>> and it would fix XIVE-on-XIVE.
>>
>> Are you saying that passthru on POWER8 is broken ? fully or only INTx ?
> 
> 
> No, the opposite - P8 works fine, kvm_notify_acked_irq() is there.
> 
> 
>>> Did I miss something in the picture (hey Cedric)?
>>
>> You seem to have all combination in mind: host OS, KVM, QEMU, guest OS
>>
>> For the record, here is a documentation we did:
>>
>>    https://qemu.readthedocs.io/en/latest/specs/ppc-spapr-xive.html
>>
>> It might need some updates.
> 
> When I read this, a quite from the Simpsons pops up in my mind: “Dear Mr. President there are too many states nowadays. Please eliminate three. I am NOT a crackpot.” :)

Yes. It blew my mind for sometime ... :)

Thanks,

C.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-04-14 12:41           ` Cédric Le Goater
@ 2022-04-21  3:07             ` Alexey Kardashevskiy
  2022-04-21  6:35               ` Cédric Le Goater
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Kardashevskiy @ 2022-04-21  3:07 UTC (permalink / raw)
  To: Cédric Le Goater, Alex Williamson, Timothy Pearson
  Cc: Frederic Barrat, list@suse.de:PowerPC, qemu-devel,
	Nicholas Piggin, David Gibson



On 14/04/2022 22:41, Cédric Le Goater wrote:
> 
>>> After re-reading what I just wrote, I am leaning towards disabling 
>>> use of KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and 
>>> never since :)
>>>
>>> Did I miss something in the picture (hey Cedric)?
>>
>> How about disabling it like this?
>>
>> =====
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 5bfd4aa9e5aa..c999f7b1ab1b 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -732,7 +732,7 @@ static PCIINTxRoute 
>> spapr_route_intx_pin_to_irq(void *opaque, int pin)
>>       SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(opaque);
>>       PCIINTxRoute route;
>>
>> -    route.mode = PCI_INTX_ENABLED;
>> +    route.mode = PCI_INTX_DISABLED;
>>
>> =====
> 
> I like it.


The only thing is that this resampling works on POWER8/XICS and removing 
it there is not great. So far sPAPR PHB was unaware of underlying 
interrupt controller, or was not it?


> 
> You now know how to test all the combinations :) Prepare your matrix,
> variables are :
> 
>   * Host OS        POWER8, POWER9+
>   * KVM device        XICS (P8), XICS-on-XIVE (P9), XIVE-on-XIVE (P9)
>   * kernel_irqchip    off, on
>   * ic-mode        xics, xive
>   * Guest OS        msi or nomsi
> 
> Ideally you should check TCG, but that's like kernel_irqchip=off.
> 
> Cheers,
> 
> C.
> 
>>
>> (btw what the heck is PCI_INTX_INVERTED for?)
>>
>>
>> -- 
>> Alexey
> 


--
Alexey


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-04-21  3:07             ` Alexey Kardashevskiy
@ 2022-04-21  6:35               ` Cédric Le Goater
  2024-04-15 16:33                 ` Timothy Pearson
  0 siblings, 1 reply; 15+ messages in thread
From: Cédric Le Goater @ 2022-04-21  6:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Alex Williamson, Timothy Pearson
  Cc: Frederic Barrat, list@suse.de:PowerPC, qemu-devel,
	Nicholas Piggin, David Gibson

On 4/21/22 05:07, Alexey Kardashevskiy wrote:
> 
> 
> On 14/04/2022 22:41, Cédric Le Goater wrote:
>>
>>>> After re-reading what I just wrote, I am leaning towards disabling use of KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never since :)
>>>>
>>>> Did I miss something in the picture (hey Cedric)?
>>>
>>> How about disabling it like this?
>>>
>>> =====
>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>> index 5bfd4aa9e5aa..c999f7b1ab1b 100644
>>> --- a/hw/ppc/spapr_pci.c
>>> +++ b/hw/ppc/spapr_pci.c
>>> @@ -732,7 +732,7 @@ static PCIINTxRoute spapr_route_intx_pin_to_irq(void *opaque, int pin)
>>>       SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(opaque);
>>>       PCIINTxRoute route;
>>>
>>> -    route.mode = PCI_INTX_ENABLED;
>>> +    route.mode = PCI_INTX_DISABLED;
>>>
>>> =====
>>
>> I like it.
> 
> 
> The only thing is that this resampling works on POWER8/XICS and 
> removing it there is not great. So far sPAPR PHB was unaware of 
> underlying interrupt controller, or was not it?

It is. The dynamic change of the underlying irqchip in QEMU and
in KVM required that for CAS. Of course, plenty is done in the
back of the devices when this happens, see spapr_irq.

There are some quirks related to LPM with VIO devices in Linux.
This is the only case I know about.

Thanks,

C.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
  2022-04-21  6:35               ` Cédric Le Goater
@ 2024-04-15 16:33                 ` Timothy Pearson
  0 siblings, 0 replies; 15+ messages in thread
From: Timothy Pearson @ 2024-04-15 16:33 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alexey Kardashevskiy, Alex Williamson, Timothy Pearson,
	list@suse.de:PowerPC, qemu-devel, Frederic Barrat, npiggin,
	David Gibson



----- Original Message -----
> From: "Cédric Le Goater" <clg@kaod.org>
> To: "Alexey Kardashevskiy" <aik@ozlabs.ru>, "Alex Williamson" <alex.williamson@redhat.com>, "Timothy Pearson"
> <tpearson@raptorengineering.com>
> Cc: "list@suse.de:PowerPC" <qemu-ppc@nongnu.org>, "qemu-devel" <qemu-devel@nongnu.org>, "Frederic Barrat"
> <fbarrat@linux.ibm.com>, "npiggin" <npiggin@gmail.com>, "David Gibson" <david@gibson.dropbear.id.au>
> Sent: Thursday, April 21, 2022 1:35:50 AM
> Subject: Re: XIVE VFIO kernel resample failure in INTx mode under heavy load

> On 4/21/22 05:07, Alexey Kardashevskiy wrote:
>> 
>> 
>> On 14/04/2022 22:41, Cédric Le Goater wrote:
>>>
>>>>> After re-reading what I just wrote, I am leaning towards disabling use of
>>>>> KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never since :)
>>>>>
>>>>> Did I miss something in the picture (hey Cedric)?
>>>>
>>>> How about disabling it like this?
>>>>
>>>> =====
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 5bfd4aa9e5aa..c999f7b1ab1b 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -732,7 +732,7 @@ static PCIINTxRoute spapr_route_intx_pin_to_irq(void
>>>> *opaque, int pin)
>>>>       SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(opaque);
>>>>       PCIINTxRoute route;
>>>>
>>>> -    route.mode = PCI_INTX_ENABLED;
>>>> +    route.mode = PCI_INTX_DISABLED;
>>>>
>>>> =====
>>>
>>> I like it.
>> 
>> 
>> The only thing is that this resampling works on POWER8/XICS and
>> removing it there is not great. So far sPAPR PHB was unaware of
>> underlying interrupt controller, or was not it?
> 
> It is. The dynamic change of the underlying irqchip in QEMU and
> in KVM required that for CAS. Of course, plenty is done in the
> back of the devices when this happens, see spapr_irq.
> 
> There are some quirks related to LPM with VIO devices in Linux.
> This is the only case I know about.
> 
> Thanks,
> 
> C.

Unfortunately this remains quite broken, and after a kernel upgrade (including the purported fix [1]) and a qemu upgrade we have now completely lost the ability to get the card working in the guest with *any* combination of parameters.

In guest XIVE mode with irqchip on it passes through a handful of interrupts, then dies.  In guest XICS mode we're dropping the majority of the interrupts.  This is all on POWER9.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/kvm/powerpc.c?id=52882b9c7a761b2b4e44717d6fbd1ed94c601b7f


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-04-15 16:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11 18:35 XIVE VFIO kernel resample failure in INTx mode under heavy load Timothy Pearson
2022-03-11 18:53 ` Timothy Pearson
2022-03-14 22:09 ` Alex Williamson
2022-03-16 16:29   ` Cédric Le Goater
2022-03-16 19:16     ` Cédric Le Goater
2022-04-13  4:56       ` Alexey Kardashevskiy
2022-04-13  6:26         ` Alexey Kardashevskiy
2022-04-14 12:41           ` Cédric Le Goater
2022-04-21  3:07             ` Alexey Kardashevskiy
2022-04-21  6:35               ` Cédric Le Goater
2024-04-15 16:33                 ` Timothy Pearson
2022-04-14 12:31         ` Cédric Le Goater
2022-04-19  1:55           ` Alexey Kardashevskiy
2022-04-19  7:35             ` Cédric Le Goater
2022-03-22  8:31   ` Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.