KVM pci-assign - iommu width is not sufficient for mapped address

All of lore.kernel.org
 help / color / mirror / Atom feed

* KVM pci-assign - iommu width is not sufficient for mapped address
@ 2016-01-07 10:18 Shyam
  2016-01-07 14:10 ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Shyam @ 2016-01-07 10:18 UTC (permalink / raw)
  To: kvm

Hi All,

We are using Linux Kernel 3.18.19 for running KVM VM's with
pci-assign'ed SRIOV VF interfaces.

We understand that VFIO is the new recommended way, but unfortunately
it reduces performance significantly on our IO workloads (upto the
order of 40-50%) when compared to pci-passthrough. We run trusted VM's
& expose services to the external world. Since we control the VM's,
IOMMU security with VFIO is not exactly mandatory, but performance is
much more important that we get with pci-assign.

We observe a strange behaviour that has already been discussed in this
forum which is upon a VM spawn it causes the following fault resulting
in qemu-kvm crashing

Jan  7 09:41:57 q6-s1 kernel: [90037.228477] intel_iommu_map: iommu
width (48) is not sufficient for the mapped address (fffffffffe001000)
Jan  7 09:41:57 q6-s1 kernel: [90037.308229]
kvm_iommu_map_address:iommu failed to map pfn=95000

We observe that this problem happens only if guest linux running 3.5
kernel is spun up & this problem doesnt happen when running guest
linux with 3.6 kernel (i.e. all guest with kernels like 3.2 etc up
till 3.5 causes the above crash whereas any guest kernel >=3.6 doesnt
cause this issue).

So something changed between kernel 3.5 to 3.6 in the guest that
doesnt expose this problem. We have two questions:
1 - we understand that VFIO suffered a similar problem & it was fixed
with https://github.com/qemu/qemu/commit/d3a2fd9b29e43e202315d5e99399b99622469c4a.
Alex Williamson suggested that KVM driver needs an equivalent version
of the fix. Can anybody suggest hints on where this fix should be
made?
2 - Any insights on what changes in linux kernel between 3.5 to 3.6 on
the guest that avoids this problem?

Any helps/input greatly appreciated. Thanks!

--Shyam

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: KVM pci-assign - iommu width is not sufficient for mapped address
  2016-01-07 10:18 KVM pci-assign - iommu width is not sufficient for mapped address Shyam
@ 2016-01-07 14:10 ` Alex Williamson
  2016-01-08  4:17   ` Shyam
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2016-01-07 14:10 UTC (permalink / raw)
  To: Shyam, kvm

On Thu, 2016-01-07 at 15:48 +0530, Shyam wrote:
> Hi All,
> 
> We are using Linux Kernel 3.18.19 for running KVM VM's with
> pci-assign'ed SRIOV VF interfaces.
> 
> We understand that VFIO is the new recommended way, but unfortunately
> it reduces performance significantly on our IO workloads (upto the
> order of 40-50%) when compared to pci-passthrough. We run trusted
> VM's
> & expose services to the external world. Since we control the VM's,
> IOMMU security with VFIO is not exactly mandatory, but performance is
> much more important that we get with pci-assign.
> 
> We observe a strange behaviour that has already been discussed in
> this
> forum which is upon a VM spawn it causes the following fault
> resulting
> in qemu-kvm crashing
> 
> Jan  7 09:41:57 q6-s1 kernel: [90037.228477] intel_iommu_map: iommu
> width (48) is not sufficient for the mapped address
> (fffffffffe001000)
> Jan  7 09:41:57 q6-s1 kernel: [90037.308229]
> kvm_iommu_map_address:iommu failed to map pfn=95000
> 
> We observe that this problem happens only if guest linux running 3.5
> kernel is spun up & this problem doesnt happen when running guest
> linux with 3.6 kernel (i.e. all guest with kernels like 3.2 etc up
> till 3.5 causes the above crash whereas any guest kernel >=3.6 doesnt
> cause this issue).
> 
> So something changed between kernel 3.5 to 3.6 in the guest that
> doesnt expose this problem. We have two questions:
> 1 - we understand that VFIO suffered a similar problem & it was fixed
> with https://github.com/qemu/qemu/commit/d3a2fd9b29e43e202315d5e99399
> b99622469c4a.
> Alex Williamson suggested that KVM driver needs an equivalent version
> of the fix. Can anybody suggest hints on where this fix should be
> made?
> 2 - Any insights on what changes in linux kernel between 3.5 to 3.6
> on
> the guest that avoids this problem?
> 
> Any helps/input greatly appreciated. Thanks!

Legacy KVM device assignment is deprecated, so I'd suggest your efforts
are better spent reporting and trying to fix any performance difference
you're seeing between pci-assign and vfio-pci.  I have a really hard
time believing there's anywhere close to a 40-50% difference.  What's
the device?  What's the workload?  At some point you're likely to find
that pci-assign is no longer even present.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: KVM pci-assign - iommu width is not sufficient for mapped address
  2016-01-07 14:10 ` Alex Williamson
@ 2016-01-08  4:17   ` Shyam
  2016-01-08  4:53     ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Shyam @ 2016-01-08  4:17 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm

Hi Alex,

Thanks for your inputs.

We are using Mellanox ConnectX-3 iSER SRIOV capable NICs. We provision
these VF's into the VM. The VM connects to few SSD drives through
iSER. For this performance test, if we expose the same SSDs through
iSER out of VM to servers & run vdbench 4K read/write workloads we see
this significant performance drop when using vfio. These VM's have 8
hyper-threads from Intel E5-2680 v3 server & 32GB RAM. The key
observation is with vfio the cpu saturates much earlier & hence cannot
allow us to scale IOPs.

I will open a separate mail thread about this performance degradation
using vfio with numbers. In the meantime if you can suggest how to
look for performance issue or what logs you would prefer for VFIO
debugging it will help in getting the needed info for you.

Thanks.

--Shyam

On Thu, Jan 7, 2016 at 7:40 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Thu, 2016-01-07 at 15:48 +0530, Shyam wrote:
>> Hi All,
>>
>> We are using Linux Kernel 3.18.19 for running KVM VM's with
>> pci-assign'ed SRIOV VF interfaces.
>>
>> We understand that VFIO is the new recommended way, but unfortunately
>> it reduces performance significantly on our IO workloads (upto the
>> order of 40-50%) when compared to pci-passthrough. We run trusted
>> VM's
>> & expose services to the external world. Since we control the VM's,
>> IOMMU security with VFIO is not exactly mandatory, but performance is
>> much more important that we get with pci-assign.
>>
>> We observe a strange behaviour that has already been discussed in
>> this
>> forum which is upon a VM spawn it causes the following fault
>> resulting
>> in qemu-kvm crashing
>>
>> Jan  7 09:41:57 q6-s1 kernel: [90037.228477] intel_iommu_map: iommu
>> width (48) is not sufficient for the mapped address
>> (fffffffffe001000)
>> Jan  7 09:41:57 q6-s1 kernel: [90037.308229]
>> kvm_iommu_map_address:iommu failed to map pfn=95000
>>
>> We observe that this problem happens only if guest linux running 3.5
>> kernel is spun up & this problem doesnt happen when running guest
>> linux with 3.6 kernel (i.e. all guest with kernels like 3.2 etc up
>> till 3.5 causes the above crash whereas any guest kernel >=3.6 doesnt
>> cause this issue).
>>
>> So something changed between kernel 3.5 to 3.6 in the guest that
>> doesnt expose this problem. We have two questions:
>> 1 - we understand that VFIO suffered a similar problem & it was fixed
>> with https://github.com/qemu/qemu/commit/d3a2fd9b29e43e202315d5e99399
>> b99622469c4a.
>> Alex Williamson suggested that KVM driver needs an equivalent version
>> of the fix. Can anybody suggest hints on where this fix should be
>> made?
>> 2 - Any insights on what changes in linux kernel between 3.5 to 3.6
>> on
>> the guest that avoids this problem?
>>
>> Any helps/input greatly appreciated. Thanks!
>
> Legacy KVM device assignment is deprecated, so I'd suggest your efforts
> are better spent reporting and trying to fix any performance difference
> you're seeing between pci-assign and vfio-pci.  I have a really hard
> time believing there's anywhere close to a 40-50% difference.  What's
> the device?  What's the workload?  At some point you're likely to find
> that pci-assign is no longer even present.  Thanks,
>
> Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: KVM pci-assign - iommu width is not sufficient for mapped address
  2016-01-08  4:17   ` Shyam
@ 2016-01-08  4:53     ` Alex Williamson
  2016-01-08  6:52       ` Shyam
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2016-01-08  4:53 UTC (permalink / raw)
  To: Shyam; +Cc: kvm

On Fri, 2016-01-08 at 09:47 +0530, Shyam wrote:
> Hi Alex,
> 
> Thanks for your inputs.
> 
> We are using Mellanox ConnectX-3 iSER SRIOV capable NICs. We
> provision
> these VF's into the VM. The VM connects to few SSD drives through
> iSER. For this performance test, if we expose the same SSDs through
> iSER out of VM to servers & run vdbench 4K read/write workloads we
> see
> this significant performance drop when using vfio. These VM's have 8
> hyper-threads from Intel E5-2680 v3 server & 32GB RAM. The key
> observation is with vfio the cpu saturates much earlier & hence
> cannot
> allow us to scale IOPs.
> 
> I will open a separate mail thread about this performance degradation
> using vfio with numbers. In the meantime if you can suggest how to
> look for performance issue or what logs you would prefer for VFIO
> debugging it will help in getting the needed info for you.

Hi Shyam,

For the degree of performance loss you're experiencing, I'd suspect
some sort of KVM acceleration is disabled.  Would it be possible to
reproduce your testing on a host running something like Fedora 23 or
RHEL7/Centos7 where we know that the kernel and QEMU are fully enabled
for vfio?

Other useful information:

 * QEMU command line or libvirt logs for VM in each configuration
 * lspci -vvv of VF from host while in operation in each config
 * QEMU version
 * grep VFIO /boot/config-`uname -r` (or wherever the running kernel
   config is on your system)
For a well behaved VF, device assignment should mostly setup VM access
and get out of the way, there should be little opportunity to inflict
such a high performance difference.  If we can't spot anything obvious
and it's reproducible on a known kernel and QEMU, we can look into
tracing to see what might be happening.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: KVM pci-assign - iommu width is not sufficient for mapped address
  2016-01-08  4:53     ` Alex Williamson
@ 2016-01-08  6:52       ` Shyam
  2016-01-08 18:52         ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Shyam @ 2016-01-08  6:52 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm

Hi Alex,

It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu based
server/VM & I can shift to any kernel/qemu/vfio versions that you
recommend.

Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with
Linux Kernel version 3.18.19 (from
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/).

Qemu version on the host is
QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21), Copyright
(c) 2003-2008 Fabrice Bellard

We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these
SSD's to the VM (through iSER) & then setup dm-stripe over them within
the VM. We create two dm-linear out of this at 100GB size & expose
through SCST to an external server. External server iSER connects to
these devices & have multipath 4Xpaths (policy: queue-length:0) per
device. From external server we run fio with 4 threads & each with
64-outstanding IOs of 100% 4K random-reads.

This is the performance difference we see

with PCI-assign to the VM
randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu
tot:85.57 usr:3.96 sys:31.55 iow:50.06

i.e. we get 137-140K IOPs or 550MB/s

with VFIO to the VM
randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu
tot:78.58 usr:2.28 sys:18.00 iow:58.30

i.e. we get 77-80K IOPs or 310MB/s

The only change between the two runs is to have a VM that is spawned
with VFIO instead of pci-assign. There is no other difference in
software versions or any settings.

$ grep VFIO /boot/config-`uname -r`
CONFIG_VFIO_IOMMU_TYPE1=m
CONFIG_VFIO=m
CONFIG_VFIO_PCI=m
CONFIG_VFIO_PCI_VGA=y
CONFIG_KVM_VFIO=y

I uploaded QEMU command-line & lspci outputs at
https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=0

Pls let me know if you have any issues in downloading it.

Please let us know if you see any KVM acceleration is disabled &
suggested next steps to debug with VFIO tracing. Thanks for your help!

--Shyam

On Fri, Jan 8, 2016 at 10:23 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Fri, 2016-01-08 at 09:47 +0530, Shyam wrote:
>> Hi Alex,
>>
>> Thanks for your inputs.
>>
>> We are using Mellanox ConnectX-3 iSER SRIOV capable NICs. We
>> provision
>> these VF's into the VM. The VM connects to few SSD drives through
>> iSER. For this performance test, if we expose the same SSDs through
>> iSER out of VM to servers & run vdbench 4K read/write workloads we
>> see
>> this significant performance drop when using vfio. These VM's have 8
>> hyper-threads from Intel E5-2680 v3 server & 32GB RAM. The key
>> observation is with vfio the cpu saturates much earlier & hence
>> cannot
>> allow us to scale IOPs.
>>
>> I will open a separate mail thread about this performance degradation
>> using vfio with numbers. In the meantime if you can suggest how to
>> look for performance issue or what logs you would prefer for VFIO
>> debugging it will help in getting the needed info for you.
>
> Hi Shyam,
>
> For the degree of performance loss you're experiencing, I'd suspect
> some sort of KVM acceleration is disabled.  Would it be possible to
> reproduce your testing on a host running something like Fedora 23 or
> RHEL7/Centos7 where we know that the kernel and QEMU are fully enabled
> for vfio?
>
> Other useful information:
>
>  * QEMU command line or libvirt logs for VM in each configuration
>  * lspci -vvv of VF from host while in operation in each config
>  * QEMU version
>  * grep VFIO /boot/config-`uname -r` (or wherever the running kernel
>    config is on your system)
> For a well behaved VF, device assignment should mostly setup VM access
> and get out of the way, there should be little opportunity to inflict
> such a high performance difference.  If we can't spot anything obvious
> and it's reproducible on a known kernel and QEMU, we can look into
> tracing to see what might be happening.  Thanks,
>
> Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: KVM pci-assign - iommu width is not sufficient for mapped address
  2016-01-08  6:52       ` Shyam
@ 2016-01-08 18:52         ` Alex Williamson
  2016-01-11 10:11           ` Shyam
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2016-01-08 18:52 UTC (permalink / raw)
  To: Shyam; +Cc: kvm

On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote:
> Hi Alex,
> 
> It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu
> based
> server/VM & I can shift to any kernel/qemu/vfio versions that you
> recommend.
> 
> Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with
> Linux Kernel version 3.18.19 (from
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/).
> 
> Qemu version on the host is
> QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21),
> Copyright
> (c) 2003-2008 Fabrice Bellard
> 
> We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these
> SSD's to the VM (through iSER) & then setup dm-stripe over them
> within
> the VM. We create two dm-linear out of this at 100GB size & expose
> through SCST to an external server. External server iSER connects to
> these devices & have multipath 4Xpaths (policy: queue-length:0) per
> device. From external server we run fio with 4 threads & each with
> 64-outstanding IOs of 100% 4K random-reads.
> 
> This is the performance difference we see
> 
> with PCI-assign to the VM
> randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu
> tot:85.57 usr:3.96 sys:31.55 iow:50.06
> 
> i.e. we get 137-140K IOPs or 550MB/s
> 
> with VFIO to the VM
> randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu
> tot:78.58 usr:2.28 sys:18.00 iow:58.30
> 
> i.e. we get 77-80K IOPs or 310MB/s
> 
> The only change between the two runs is to have a VM that is spawned
> with VFIO instead of pci-assign. There is no other difference in
> software versions or any settings.
> 
> $ grep VFIO /boot/config-`uname -r`
> CONFIG_VFIO_IOMMU_TYPE1=m
> CONFIG_VFIO=m
> CONFIG_VFIO_PCI=m
> CONFIG_VFIO_PCI_VGA=y
> CONFIG_KVM_VFIO=y
> 
> I uploaded QEMU command-line & lspci outputs at
> https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=0
> 
> Pls let me know if you have any issues in downloading it.
> 
> Please let us know if you see any KVM acceleration is disabled &
> suggested next steps to debug with VFIO tracing. Thanks for your
> help!

Thanks for the logs, everything appears to be setup correctly.  One
suspicion I have is the difference between pci-assign and vfio-pci in
the way the MSI-X Pending Bits Array (PBA) is handled.  Legacy KVM
device assignment handles MSI-X itself and ignores the PBA.  On this
hardware the MSI-X vector table and PBA are nicely aligned on separate
4k pages, which means that pci-assign will give the VM direct access to
everything on the PBA page.  On the other hand, vfio-pci registers MSI-
X with QEMU, which does handle the PBA.  The vast majority of drivers
never use the PBA and the PCI spec includes an implementation note
suggesting that hardware vendors include additional alignment to
prevent MSI-X structures from overlapping with other registers.  My
hypothesis is that this device perhaps does not abide by that
recommendation and may be regularly accessing the PBA page, thus
causing a vfio-pci assigned device to trap through to QEMU more
regularly than a legacy assigned device.

If I could ask you to build and run a new QEMU, I think we can easily
test this hypothesis by making vfio-pci behave more like pci-assign.
 The following patch is based on QEMU 2.5 and simply skips the step of
placing the PBA memory region overlapping the device, allowing direct
access in this case.  The patch is easily adaptable to older versions
of QEMU, but if we need to do any further tracing, it's probably best
to do so on 2.5 anyway.  This is only a proof of concept, if it proves
to be the culprit we'll need to think about how to handle it more
cleanly.  Here's the patch:

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 64c93d8..a5ad18c 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -291,7 +291,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries,
     memory_region_add_subregion(table_bar, table_offset, &dev->msix_table_mmio);
     memory_region_init_io(&dev->msix_pba_mmio, OBJECT(dev), &msix_pba_mmio_ops, dev,
                           "msix-pba", pba_size);
-    memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio);
+    /* memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); */

     return 0;
 }
@@ -369,7 +369,7 @@ void msix_uninit(PCIDevice *dev, MemoryRegion *table_bar, MemoryRegion *pba_bar)
     dev->msix_cap = 0;
     msix_free_irq_entries(dev);
     dev->msix_entries_nr = 0;
-    memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio);
+    /* memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); */
     g_free(dev->msix_pba);
     dev->msix_pba = NULL;
     memory_region_del_subregion(table_bar, &dev->msix_table_mmio);

Thanks,
Alex

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: KVM pci-assign - iommu width is not sufficient for mapped address
  2016-01-08 18:52         ` Alex Williamson
@ 2016-01-11 10:11           ` Shyam
  2016-01-11 23:20             ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Shyam @ 2016-01-11 10:11 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm

Hi Alex,

You are spot on!

Applying your patch on QEMU 2.5.50 (latest from github master) solves
the performance issue fully. We are able to get back to pci-assign
performance numbers. Great!

Can you please see how to formalize this patch cleanly? I will be
happy to test additional patches for you. Thanks a lot for your help!

--Shyam

On Sat, Jan 9, 2016 at 12:22 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote:
>> Hi Alex,
>>
>> It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu
>> based
>> server/VM & I can shift to any kernel/qemu/vfio versions that you
>> recommend.
>>
>> Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with
>> Linux Kernel version 3.18.19 (from
>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/).
>>
>> Qemu version on the host is
>> QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21),
>> Copyright
>> (c) 2003-2008 Fabrice Bellard
>>
>> We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these
>> SSD's to the VM (through iSER) & then setup dm-stripe over them
>> within
>> the VM. We create two dm-linear out of this at 100GB size & expose
>> through SCST to an external server. External server iSER connects to
>> these devices & have multipath 4Xpaths (policy: queue-length:0) per
>> device. From external server we run fio with 4 threads & each with
>> 64-outstanding IOs of 100% 4K random-reads.
>>
>> This is the performance difference we see
>>
>> with PCI-assign to the VM
>> randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu
>> tot:85.57 usr:3.96 sys:31.55 iow:50.06
>>
>> i.e. we get 137-140K IOPs or 550MB/s
>>
>> with VFIO to the VM
>> randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu
>> tot:78.58 usr:2.28 sys:18.00 iow:58.30
>>
>> i.e. we get 77-80K IOPs or 310MB/s
>>
>> The only change between the two runs is to have a VM that is spawned
>> with VFIO instead of pci-assign. There is no other difference in
>> software versions or any settings.
>>
>> $ grep VFIO /boot/config-`uname -r`
>> CONFIG_VFIO_IOMMU_TYPE1=m
>> CONFIG_VFIO=m
>> CONFIG_VFIO_PCI=m
>> CONFIG_VFIO_PCI_VGA=y
>> CONFIG_KVM_VFIO=y
>>
>> I uploaded QEMU command-line & lspci outputs at
>> https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=0
>>
>> Pls let me know if you have any issues in downloading it.
>>
>> Please let us know if you see any KVM acceleration is disabled &
>> suggested next steps to debug with VFIO tracing. Thanks for your
>> help!
>
> Thanks for the logs, everything appears to be setup correctly.  One
> suspicion I have is the difference between pci-assign and vfio-pci in
> the way the MSI-X Pending Bits Array (PBA) is handled.  Legacy KVM
> device assignment handles MSI-X itself and ignores the PBA.  On this
> hardware the MSI-X vector table and PBA are nicely aligned on separate
> 4k pages, which means that pci-assign will give the VM direct access to
> everything on the PBA page.  On the other hand, vfio-pci registers MSI-
> X with QEMU, which does handle the PBA.  The vast majority of drivers
> never use the PBA and the PCI spec includes an implementation note
> suggesting that hardware vendors include additional alignment to
> prevent MSI-X structures from overlapping with other registers.  My
> hypothesis is that this device perhaps does not abide by that
> recommendation and may be regularly accessing the PBA page, thus
> causing a vfio-pci assigned device to trap through to QEMU more
> regularly than a legacy assigned device.
>
> If I could ask you to build and run a new QEMU, I think we can easily
> test this hypothesis by making vfio-pci behave more like pci-assign.
>  The following patch is based on QEMU 2.5 and simply skips the step of
> placing the PBA memory region overlapping the device, allowing direct
> access in this case.  The patch is easily adaptable to older versions
> of QEMU, but if we need to do any further tracing, it's probably best
> to do so on 2.5 anyway.  This is only a proof of concept, if it proves
> to be the culprit we'll need to think about how to handle it more
> cleanly.  Here's the patch:
>
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index 64c93d8..a5ad18c 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -291,7 +291,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries,
>      memory_region_add_subregion(table_bar, table_offset, &dev->msix_table_mmio);
>      memory_region_init_io(&dev->msix_pba_mmio, OBJECT(dev), &msix_pba_mmio_ops, dev,
>                            "msix-pba", pba_size);
> -    memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio);
> +    /* memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); */
>
>      return 0;
>  }
> @@ -369,7 +369,7 @@ void msix_uninit(PCIDevice *dev, MemoryRegion *table_bar, MemoryRegion *pba_bar)
>      dev->msix_cap = 0;
>      msix_free_irq_entries(dev);
>      dev->msix_entries_nr = 0;
> -    memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio);
> +    /* memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); */
>      g_free(dev->msix_pba);
>      dev->msix_pba = NULL;
>      memory_region_del_subregion(table_bar, &dev->msix_table_mmio);
>
> Thanks,
> Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: KVM pci-assign - iommu width is not sufficient for mapped address
  2016-01-11 10:11           ` Shyam
@ 2016-01-11 23:20             ` Alex Williamson
  2016-01-12  7:04               ` Shyam
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2016-01-11 23:20 UTC (permalink / raw)
  To: Shyam; +Cc: kvm

On Mon, 2016-01-11 at 15:41 +0530, Shyam wrote:
> Hi Alex,
> 
> You are spot on!
> 
> Applying your patch on QEMU 2.5.50 (latest from github master) solves
> the performance issue fully. We are able to get back to pci-assign
> performance numbers. Great!
> 
> Can you please see how to formalize this patch cleanly? I will be
> happy to test additional patches for you. Thanks a lot for your help!

Hi Shyam,

Thanks for the testing.  I'm really tempted to just disable PBA
emulation altogether, but I came up with the below patch which enables
it only in the off chance that it's needed.  Patch is against current
qemu.git, please test.  Thanks!

Alex

commit 4f97c12c9f801fabdd3405758408f516e8ea1a80
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Mon Jan 11 10:44:13 2016 -0700

    vfio/pci: Lazy PBA emulation
    
    The PCI spec recommends devices use additional alignment for MSI-X
    data structures to allow software to map them to separate processor
    pages.  One advantage of doing this is that we can emulate those data
    structures without a significant performance impact to the operation
    of the device.  Some devices fail to implement that suggestion and
    assigned device performance suffers.
    
    One such case of this is a Mellanox MT27500 series, ConnectX-3 VF,
    where the MSI-X vector table and PBA are aligned on separate 4K
    pages.  If PBA emulation is enabled, performance suffers.  It's not
    clear how much value we get from PBA emulation, but the solution here
    is to only lazily enable the emulated PBA when a masked MSI-X vector
    fires.  We then attempt to more aggresively disable the PBA memory
    region any time a vector is unmasked.  The expectation is then that
    a typical VM will run entirely with PBA emulation disabled, and only
    when used is that emulation re-enabled.
    
    Reported-by: Shyam Kaushik <shyam.kaushik@gmail.com>
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1fb868c..e66c47f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -356,6 +356,13 @@ static void vfio_msi_interrupt(void *opaque)
     if (vdev->interrupt == VFIO_INT_MSIX) {
         get_msg = msix_get_message;
         notify = msix_notify;
+
+        /* A masked vector firing needs to use the PBA, enable it */
+        if (msix_is_masked(&vdev->pdev, nr)) {
+            set_bit(nr, vdev->msix->pending);
+            memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, true);
+            trace_vfio_msix_pba_enable(vdev->vbasedev.name);
+        }
     } else if (vdev->interrupt == VFIO_INT_MSI) {
         get_msg = msi_get_message;
         notify = msi_notify;
@@ -535,6 +542,14 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
         }
     }
 
+    /* Disable PBA emulation when nothing more is pending. */
+    clear_bit(nr, vdev->msix->pending);
+    if (find_first_bit(vdev->msix->pending,
+                       vdev->nr_vectors) == vdev->nr_vectors) {
+        memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
+        trace_vfio_msix_pba_disable(vdev->vbasedev.name);
+    }
+
     return 0;
 }
 
@@ -738,6 +753,9 @@ static void vfio_msix_disable(VFIOPCIDevice *vdev)
 
     vfio_msi_disable_common(vdev);
 
+    memset(vdev->msix->pending, 0,
+           BITS_TO_LONGS(vdev->msix->entries) * sizeof(unsigned long));
+
     trace_vfio_msix_disable(vdev->vbasedev.name);
 }
 
@@ -1251,6 +1269,8 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos)
 {
     int ret;
 
+    vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
+                                    sizeof(unsigned long));
     ret = msix_init(&vdev->pdev, vdev->msix->entries,
                     &vdev->bars[vdev->msix->table_bar].region.mem,
                     vdev->msix->table_bar, vdev->msix->table_offset,
@@ -1264,6 +1284,24 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos)
         return ret;
     }
 
+    /*
+     * The PCI spec suggests that devices provide additional alignment for
+     * MSI-X structures and avoid overlapping non-MSI-X related registers.
+     * For an assigned device, this hopefully means that emulation of MSI-X
+     * structures does not affect the performance of the device.  If devices
+     * fail to provide that alignment, a significant performance penalty may
+     * result, for instance Mellanox MT27500 VFs:
+     * http://www.spinics.net/lists/kvm/msg125881.html
+     *
+     * The PBA is simply not that important for such a serious regression and
+     * most drivers do not appear to look at it.  The solution for this is to
+     * disable the PBA MemoryRegion unless it's being used.  We disable it
+     * here and only enable it if a masked vector fires through QEMU.  As the
+     * vector-use notifier is called, which occurs on unmask, we test whether
+     * PBA emulation is needed and again disable if not.
+     */
+    memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
+
     return 0;
 }
 
@@ -1275,6 +1313,7 @@ static void vfio_teardown_msi(VFIOPCIDevice *vdev)
         msix_uninit(&vdev->pdev,
                     &vdev->bars[vdev->msix->table_bar].region.mem,
                     &vdev->bars[vdev->msix->pba_bar].region.mem);
+        g_free(vdev->msix->pending);
     }
 }
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index f004d52..6256587 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -95,6 +95,7 @@ typedef struct VFIOMSIXInfo {
     uint32_t pba_offset;
     MemoryRegion mmap_mem;
     void *mmap;
+    unsigned long *pending;
 } VFIOMSIXInfo;
 
 typedef struct VFIOPCIDevice {
diff --git a/trace-events b/trace-events
index 934a7b6..c9ac144 100644
--- a/trace-events
+++ b/trace-events
@@ -1631,6 +1631,8 @@ vfio_msi_interrupt(const char *name, int index, uint64_t addr, int data) " (%s)
 vfio_msix_vector_do_use(const char *name, int index) " (%s) vector %d used"
 vfio_msix_vector_release(const char *name, int index) " (%s) vector %d released"
 vfio_msix_enable(const char *name) " (%s)"
+vfio_msix_pba_disable(const char *name) " (%s)"
+vfio_msix_pba_enable(const char *name) " (%s)"
 vfio_msix_disable(const char *name) " (%s)"
 vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
 vfio_msi_disable(const char *name) " (%s)"

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: KVM pci-assign - iommu width is not sufficient for mapped address
  2016-01-11 23:20             ` Alex Williamson
@ 2016-01-12  7:04               ` Shyam
  0 siblings, 0 replies; 9+ messages in thread
From: Shyam @ 2016-01-12  7:04 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm

Hi Alex,

I tested your new patch & it works great! Yields similar performance
as your disable PBA emulation patch.

So I think you are good to commit. Once you commit we will start using
qemu.git/master.

Thanks again for your great support!

--Shyam


On Tue, Jan 12, 2016 at 4:50 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Mon, 2016-01-11 at 15:41 +0530, Shyam wrote:
>> Hi Alex,
>>
>> You are spot on!
>>
>> Applying your patch on QEMU 2.5.50 (latest from github master) solves
>> the performance issue fully. We are able to get back to pci-assign
>> performance numbers. Great!
>>
>> Can you please see how to formalize this patch cleanly? I will be
>> happy to test additional patches for you. Thanks a lot for your help!
>
> Hi Shyam,
>
> Thanks for the testing.  I'm really tempted to just disable PBA
> emulation altogether, but I came up with the below patch which enables
> it only in the off chance that it's needed.  Patch is against current
> qemu.git, please test.  Thanks!
>
> Alex
>
> commit 4f97c12c9f801fabdd3405758408f516e8ea1a80
> Author: Alex Williamson <alex.williamson@redhat.com>
> Date:   Mon Jan 11 10:44:13 2016 -0700
>
>     vfio/pci: Lazy PBA emulation
>
>     The PCI spec recommends devices use additional alignment for MSI-X
>     data structures to allow software to map them to separate processor
>     pages.  One advantage of doing this is that we can emulate those data
>     structures without a significant performance impact to the operation
>     of the device.  Some devices fail to implement that suggestion and
>     assigned device performance suffers.
>
>     One such case of this is a Mellanox MT27500 series, ConnectX-3 VF,
>     where the MSI-X vector table and PBA are aligned on separate 4K
>     pages.  If PBA emulation is enabled, performance suffers.  It's not
>     clear how much value we get from PBA emulation, but the solution here
>     is to only lazily enable the emulated PBA when a masked MSI-X vector
>     fires.  We then attempt to more aggresively disable the PBA memory
>     region any time a vector is unmasked.  The expectation is then that
>     a typical VM will run entirely with PBA emulation disabled, and only
>     when used is that emulation re-enabled.
>
>     Reported-by: Shyam Kaushik <shyam.kaushik@gmail.com>
>     Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 1fb868c..e66c47f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -356,6 +356,13 @@ static void vfio_msi_interrupt(void *opaque)
>      if (vdev->interrupt == VFIO_INT_MSIX) {
>          get_msg = msix_get_message;
>          notify = msix_notify;
> +
> +        /* A masked vector firing needs to use the PBA, enable it */
> +        if (msix_is_masked(&vdev->pdev, nr)) {
> +            set_bit(nr, vdev->msix->pending);
> +            memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, true);
> +            trace_vfio_msix_pba_enable(vdev->vbasedev.name);
> +        }
>      } else if (vdev->interrupt == VFIO_INT_MSI) {
>          get_msg = msi_get_message;
>          notify = msi_notify;
> @@ -535,6 +542,14 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>          }
>      }
>
> +    /* Disable PBA emulation when nothing more is pending. */
> +    clear_bit(nr, vdev->msix->pending);
> +    if (find_first_bit(vdev->msix->pending,
> +                       vdev->nr_vectors) == vdev->nr_vectors) {
> +        memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
> +        trace_vfio_msix_pba_disable(vdev->vbasedev.name);
> +    }
> +
>      return 0;
>  }
>
> @@ -738,6 +753,9 @@ static void vfio_msix_disable(VFIOPCIDevice *vdev)
>
>      vfio_msi_disable_common(vdev);
>
> +    memset(vdev->msix->pending, 0,
> +           BITS_TO_LONGS(vdev->msix->entries) * sizeof(unsigned long));
> +
>      trace_vfio_msix_disable(vdev->vbasedev.name);
>  }
>
> @@ -1251,6 +1269,8 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos)
>  {
>      int ret;
>
> +    vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
> +                                    sizeof(unsigned long));
>      ret = msix_init(&vdev->pdev, vdev->msix->entries,
>                      &vdev->bars[vdev->msix->table_bar].region.mem,
>                      vdev->msix->table_bar, vdev->msix->table_offset,
> @@ -1264,6 +1284,24 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos)
>          return ret;
>      }
>
> +    /*
> +     * The PCI spec suggests that devices provide additional alignment for
> +     * MSI-X structures and avoid overlapping non-MSI-X related registers.
> +     * For an assigned device, this hopefully means that emulation of MSI-X
> +     * structures does not affect the performance of the device.  If devices
> +     * fail to provide that alignment, a significant performance penalty may
> +     * result, for instance Mellanox MT27500 VFs:
> +     * http://www.spinics.net/lists/kvm/msg125881.html
> +     *
> +     * The PBA is simply not that important for such a serious regression and
> +     * most drivers do not appear to look at it.  The solution for this is to
> +     * disable the PBA MemoryRegion unless it's being used.  We disable it
> +     * here and only enable it if a masked vector fires through QEMU.  As the
> +     * vector-use notifier is called, which occurs on unmask, we test whether
> +     * PBA emulation is needed and again disable if not.
> +     */
> +    memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
> +
>      return 0;
>  }
>
> @@ -1275,6 +1313,7 @@ static void vfio_teardown_msi(VFIOPCIDevice *vdev)
>          msix_uninit(&vdev->pdev,
>                      &vdev->bars[vdev->msix->table_bar].region.mem,
>                      &vdev->bars[vdev->msix->pba_bar].region.mem);
> +        g_free(vdev->msix->pending);
>      }
>  }
>
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index f004d52..6256587 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -95,6 +95,7 @@ typedef struct VFIOMSIXInfo {
>      uint32_t pba_offset;
>      MemoryRegion mmap_mem;
>      void *mmap;
> +    unsigned long *pending;
>  } VFIOMSIXInfo;
>
>  typedef struct VFIOPCIDevice {
> diff --git a/trace-events b/trace-events
> index 934a7b6..c9ac144 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1631,6 +1631,8 @@ vfio_msi_interrupt(const char *name, int index, uint64_t addr, int data) " (%s)
>  vfio_msix_vector_do_use(const char *name, int index) " (%s) vector %d used"
>  vfio_msix_vector_release(const char *name, int index) " (%s) vector %d released"
>  vfio_msix_enable(const char *name) " (%s)"
> +vfio_msix_pba_disable(const char *name) " (%s)"
> +vfio_msix_pba_enable(const char *name) " (%s)"
>  vfio_msix_disable(const char *name) " (%s)"
>  vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
>  vfio_msi_disable(const char *name) " (%s)"

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-01-12  7:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-07 10:18 KVM pci-assign - iommu width is not sufficient for mapped address Shyam
2016-01-07 14:10 ` Alex Williamson
2016-01-08  4:17   ` Shyam
2016-01-08  4:53     ` Alex Williamson
2016-01-08  6:52       ` Shyam
2016-01-08 18:52         ` Alex Williamson
2016-01-11 10:11           ` Shyam
2016-01-11 23:20             ` Alex Williamson
2016-01-12  7:04               ` Shyam

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.