All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Cédric Le Goater" <clg@kaod.org>
To: Alex Williamson <alex.williamson@redhat.com>,
	Timothy Pearson <tpearson@raptorengineering.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>,
	"list@suse.de:PowerPC" <qemu-ppc@nongnu.org>,
	qemu-devel@nongnu.org
Subject: Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
Date: Wed, 16 Mar 2022 20:16:39 +0100	[thread overview]
Message-ID: <6f0a92ca-9f53-b8b8-e85d-43f4da36200d@kaod.org> (raw)
In-Reply-To: <9638ec8f-2edf-97df-0c14-95ae2344dc70@kaod.org>

Timothy,

On 3/16/22 17:29, Cédric Le Goater wrote:
> Hello,
> 
> 
>> I've been struggling for some time with what is looking like a
>> potential bug in QEMU/KVM on the POWER9 platform.  It appears that
>> in XIVE mode, when the in-kernel IRQ chip is enabled, an external
>> device that rapidly asserts IRQs via the legacy INTx level mechanism
>> will only receive one interrupt in the KVM guest.
> 
> Indeed. I could reproduce with a pass-through PCI adapter using
> 'pci=nomsi'. The virtio devices operate correctly but the network
> adapter only receives one event (*):
> 
> 
> $ cat /proc/interrupts
>             CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>   16:       2198       1378       1519       1216          0          0          0          0  XIVE-IPI   0 Edge      IPI-0
>   17:          0          0          0          0       2003       1936       1335       1507  XIVE-IPI   1 Edge      IPI-1
>   18:          0       6401          0          0          0          0          0          0  XIVE-IRQ 4609 Level     virtio3, virtio0, virtio2
>   19:          0          0          0          0          0        204          0          0  XIVE-IRQ 4610 Level     virtio1
>   20:          0          0          0          0          0          0          0          0  XIVE-IRQ 4608 Level     xhci-hcd:usb1
>   21:          0          1          0          0          0          0          0          0  XIVE-IRQ 4612 Level     eth1 (*)
>   23:          0          0          0          0          0          0          0          0  XIVE-IRQ 4096 Edge      RAS_EPOW
>   24:          0          0          0          0          0          0          0          0  XIVE-IRQ 4592 Edge      hvc_console
>   26:          0          0          0          0          0          0          0          0  XIVE-IRQ 4097 Edge      RAS_HOTPLUG
> 
>> Changing any one of those items appears to avoid the glitch, e.g. XICS
> 
> XICS is very different from XIVE. The driver implements the previous
> interrupt controller architecture (P5-P8) and the hypervisor mediates
> the delivery to the guest. With XIVE, vCPUs are directly signaled by
> the IC. When under KVM, we use different KVM devices for each mode :
> 
> * KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
>    not using the XIVE native interface. RHEL7 for instance.
> * KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
>    using the XIVE native interface. Linux > 4.14.
> * KVM XICS is for P8 hosts (no XIVE HW)
> 
> VFIO adds some complexity with the source events. I think the problem
> comes from the assertion state. I will talk about it later.
> 
>> mode with the in-kernel IRQ chip works (all interrupts are passed
>> through),
> 
> All interrupts are passed through using XIVE also. Run 'info pic' in
> the monitor. On the host, check the IRQ mapping in the debugfs file :
> 
>    /sys/kernel/debug/powerpc/kvm-xive-*
> 
>> and XIVE mode with the in-kernel IRQ chip disabled also works. 
> 
> In that case, no KVM device backs the QEMU device and all state
> is in one place.
> 
>> We
>> are also not seeing any problems in XIVE mode with the in-kernel
>> chip from MSI/MSI-X devices.
> 
> Yes. pass-through devices are expected to operate correctly :)
> 
>> The device in question is a real time card that needs to raise an
>> interrupt every 1ms.  It works perfectly on the host, but fails in
>> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
>> receives exactly one interrupt, at which point the host continues to
>> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
>> kernel is never reentered.
> 
> ok. Same symptom as the scenario above.
> 
>> We have also seen some very rare glitches where, over a long period
>> of time, we can enter a similar deadlock in XICS mode.
> 
> with the in-kernel XICS IRQ chip ?
> 
>> Disabling
>> the in-kernel IRQ chip in XIVE mode will also lead to the lockup
>> with this device, since the userspace IRQ emulation cannot keep up
>> with the rapid interrupt firing (measurements show around 100ms
>> required for processing each interrupt in the user mode).
> 
> MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
> it is handled as a special case in the device/driver. To maintain the
> assertion state, all LSI handling is done with a special HCALL :
> H_INT_ESB which is implemented in QEMU. This generates a lot of back
> and forth in the KVM stack.
> 
>> My understanding is the resample mechanism does some clever tricks
>> with level IRQs, but that QEMU needs to check if the IRQ is still
>> asserted by the device on guest EOI.
> 
> Yes. the problem is in that area.
> 
>> Since a failure here would
>> explain these symptoms I'm wondering if there is a bug in either
>> QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
>> resampled and therefore not re-fired in the guest?
> 
> KVM I would say. The assertion state is maintained in KVM for the KVM
> XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
> device. These are good candidates. I will take a look.

All works fine with KVM_CAP_IRQFD_RESAMPLE=false in QEMU. Can you please
try this workaround for now ? I could reach 934 Mbits/sec on the passthru
device.

I clearly overlooked that part and it has been 3 years.

Thanks,

C.



  reply	other threads:[~2022-03-16 19:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-11 18:35 XIVE VFIO kernel resample failure in INTx mode under heavy load Timothy Pearson
2022-03-11 18:53 ` Timothy Pearson
2022-03-14 22:09 ` Alex Williamson
2022-03-16 16:29   ` Cédric Le Goater
2022-03-16 19:16     ` Cédric Le Goater [this message]
2022-04-13  4:56       ` Alexey Kardashevskiy
2022-04-13  6:26         ` Alexey Kardashevskiy
2022-04-14 12:41           ` Cédric Le Goater
2022-04-21  3:07             ` Alexey Kardashevskiy
2022-04-21  6:35               ` Cédric Le Goater
2024-04-15 16:33                 ` Timothy Pearson
2022-04-14 12:31         ` Cédric Le Goater
2022-04-19  1:55           ` Alexey Kardashevskiy
2022-04-19  7:35             ` Cédric Le Goater
2022-03-22  8:31   ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6f0a92ca-9f53-b8b8-e85d-43f4da36200d@kaod.org \
    --to=clg@kaod.org \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=tpearson@raptorengineering.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.