All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
@ 2012-04-05  3:42 Alex Williamson
  2012-04-05  7:28 ` Jan Kiszka
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Alex Williamson @ 2012-04-05  3:42 UTC (permalink / raw)
  To: kvm; +Cc: alex.williamson, jan.kiszka, mst, jbaron

We've hit a kernel host panic, when issuing a 'system_reset' with an
82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.

[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 1, fatal
[Hardware Error]: section: 0, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 0, PCIe end point
[Hardware Error]: version: 1.0
[Hardware Error]: command: 0x0000, status: 0x0010
[Hardware Error]: device_id: 0000:08:00.0
[Hardware Error]: slot: 1
[Hardware Error]: secondary_bus: 0x00
[Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
[Hardware Error]: class_code: 000002
[Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
[Hardware Error]: Unsupported Request
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x00067011
[Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
[Hardware Error]: section: 1, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 0, PCIe end point
[Hardware Error]: version: 1.0
[Hardware Error]: command: 0x0000, status: 0x0010
[Hardware Error]: device_id: 0000:08:00.0
[Hardware Error]: slot: 1
[Hardware Error]: secondary_bus: 0x00
[Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
[Hardware Error]: class_code: 000002
[Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
[Hardware Error]: Unsupported Request
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x00067011
[Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
Kernel panic - not syncing: Fatal hardware error!
Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
Call Trace:
 <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
 [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
 [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
 [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff8109667e>] ? notify_die+0x2e/0x30
 [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
 [<ffffffff814f6760>] ? nmi+0x20/0x30
 [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
 <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
 [<ffffffff814da63a>] ? rest_init+0x7a/0x80
 [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109

The root cause of the problem is that the 'reset_assigned_device()' code
first writes a 0 to the command register. Then, when qemu subsequently does
a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
the kernel ends up calling '__msix_mask_irq()', which performs a write to
the memory mapped msi vector space. Since, we've explicitly told the device
to disallow mmio access (via the 0 write to the command register), we end
up with the above 'Unsupported Request'.

The fix here is to first disable MSI-X, before doing the reset.  We also
disable MSI, leaving the device in INTx mode.  In this way, the device is
a known state after reset, and we avoid touching msi memory mapped space
on any subsequent 'kvm_deassign_irq()'.

Thanks to Michael S. Tsirkin for help in understanding what was going on
here and Jason Baron, the original debugger of this problem.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

Jason is out of the office for a couple weeks, so I'll try to resolve
this while he's away.  Somehow the emulated config updates were lost
in Jason's original posting, so I've fixed that and taken Jan's suggestion
to simply call into the update functions instead of open coding the
interrupt disable.  I think there still may be some disagreements about
how to handle guest generated errors in the host, but that's a large
project whereas this is something we should be doing at reset anyway,
and even if only a workaround, resolves the problem above.

 hw/device-assignment.c |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 89823f1..2e6b93e 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
     const char reset[] = "1";
     int fd, ret;
 
+    /*
+     * If a guest is reset without being shutdown, MSI/MSI-X can still
+     * be running.  We want to return the device to a known state on
+     * reset, so disable those here.  We especially do not want MSI-X
+     * enabled since it lives in MMIO space, which is about to get
+     * disabled.
+     */
+    if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSIX) {
+        uint16_t ctrl = pci_get_word(pci_dev->config +
+                                     pci_dev->msix_cap + PCI_MSIX_FLAGS);
+
+        pci_set_word(pci_dev->config + pci_dev->msix_cap + PCI_MSIX_FLAGS,
+                     ctrl & ~PCI_MSIX_FLAGS_ENABLE);
+        assigned_dev_update_msix(pci_dev);
+    } else if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSI) {
+        uint8_t ctrl = pci_get_byte(pci_dev->config +
+                                    pci_dev->msi_cap + PCI_MSI_FLAGS);
+
+        pci_set_byte(pci_dev->config + pci_dev->msi_cap + PCI_MSI_FLAGS,
+                     ctrl & ~PCI_MSI_FLAGS_ENABLE);
+        assigned_dev_update_msi(pci_dev);
+    }
+
     snprintf(reset_file, sizeof(reset_file),
              "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/reset",
              adev->host.seg, adev->host.bus, adev->host.dev, adev->host.func);


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-05  3:42 [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path Alex Williamson
@ 2012-04-05  7:28 ` Jan Kiszka
  2012-04-05  9:34 ` Michael S. Tsirkin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Jan Kiszka @ 2012-04-05  7:28 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, mst, jbaron

On 2012-04-05 05:42, Alex Williamson wrote:
> We've hit a kernel host panic, when issuing a 'system_reset' with an
> 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
> 
> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> [Hardware Error]: APEI generic hardware error status
> [Hardware Error]: severity: 1, fatal
> [Hardware Error]: section: 0, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> [Hardware Error]: section: 1, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> Kernel panic - not syncing: Fatal hardware error!
> Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> Call Trace:
>  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
>  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
>  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
>  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
>  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
>  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
>  [<ffffffff814f6760>] ? nmi+0x20/0x30
>  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
>  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
>  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
>  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
>  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
>  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
>  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
> 
> The root cause of the problem is that the 'reset_assigned_device()' code
> first writes a 0 to the command register. Then, when qemu subsequently does
> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> the memory mapped msi vector space. Since, we've explicitly told the device
> to disallow mmio access (via the 0 write to the command register), we end
> up with the above 'Unsupported Request'.
> 
> The fix here is to first disable MSI-X, before doing the reset.  We also
> disable MSI, leaving the device in INTx mode.  In this way, the device is
> a known state after reset, and we avoid touching msi memory mapped space
> on any subsequent 'kvm_deassign_irq()'.
> 
> Thanks to Michael S. Tsirkin for help in understanding what was going on
> here and Jason Baron, the original debugger of this problem.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Jason is out of the office for a couple weeks, so I'll try to resolve
> this while he's away.  Somehow the emulated config updates were lost
> in Jason's original posting, so I've fixed that and taken Jan's suggestion
> to simply call into the update functions instead of open coding the
> interrupt disable.  I think there still may be some disagreements about
> how to handle guest generated errors in the host, but that's a large
> project whereas this is something we should be doing at reset anyway,
> and even if only a workaround, resolves the problem above.
> 
>  hw/device-assignment.c |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..2e6b93e 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
>      const char reset[] = "1";
>      int fd, ret;
>  
> +    /*
> +     * If a guest is reset without being shutdown, MSI/MSI-X can still
> +     * be running.  We want to return the device to a known state on
> +     * reset, so disable those here.  We especially do not want MSI-X

To be precised: It is the specified state after reset which we fail to
restore so far.

> +     * enabled since it lives in MMIO space, which is about to get
> +     * disabled.
> +     */
> +    if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSIX) {
> +        uint16_t ctrl = pci_get_word(pci_dev->config +
> +                                     pci_dev->msix_cap + PCI_MSIX_FLAGS);
> +
> +        pci_set_word(pci_dev->config + pci_dev->msix_cap + PCI_MSIX_FLAGS,
> +                     ctrl & ~PCI_MSIX_FLAGS_ENABLE);
> +        assigned_dev_update_msix(pci_dev);
> +    } else if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSI) {
> +        uint8_t ctrl = pci_get_byte(pci_dev->config +
> +                                    pci_dev->msi_cap + PCI_MSI_FLAGS);
> +
> +        pci_set_byte(pci_dev->config + pci_dev->msi_cap + PCI_MSI_FLAGS,
> +                     ctrl & ~PCI_MSI_FLAGS_ENABLE);
> +        assigned_dev_update_msi(pci_dev);
> +    }
> +
>      snprintf(reset_file, sizeof(reset_file),
>               "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/reset",
>               adev->host.seg, adev->host.bus, adev->host.dev, adev->host.func);
> 

Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-05  3:42 [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path Alex Williamson
  2012-04-05  7:28 ` Jan Kiszka
@ 2012-04-05  9:34 ` Michael S. Tsirkin
  2012-04-05 14:42   ` Alex Williamson
  2012-04-08 13:14 ` Avi Kivity
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-05  9:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, jan.kiszka, jbaron

On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote:
> We've hit a kernel host panic, when issuing a 'system_reset' with an
> 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
> 
> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> [Hardware Error]: APEI generic hardware error status
> [Hardware Error]: severity: 1, fatal
> [Hardware Error]: section: 0, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> [Hardware Error]: section: 1, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> Kernel panic - not syncing: Fatal hardware error!
> Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> Call Trace:
>  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
>  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
>  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
>  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
>  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
>  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
>  [<ffffffff814f6760>] ? nmi+0x20/0x30
>  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
>  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
>  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
>  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
>  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
>  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
>  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
> 
> The root cause of the problem is that the 'reset_assigned_device()' code
> first writes a 0 to the command register. Then, when qemu subsequently does
> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> the memory mapped msi vector space. Since, we've explicitly told the device
> to disallow mmio access (via the 0 write to the command register), we end
> up with the above 'Unsupported Request'.
> 
> The fix here is to first disable MSI-X, before doing the reset.  We also
> disable MSI, leaving the device in INTx mode.  In this way, the device is
> a known state after reset, and we avoid touching msi memory mapped space
> on any subsequent 'kvm_deassign_irq()'.
> 
> Thanks to Michael S. Tsirkin for help in understanding what was going on
> here and Jason Baron, the original debugger of this problem.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Jason is out of the office for a couple weeks, so I'll try to resolve
> this while he's away.  Somehow the emulated config updates were lost
> in Jason's original posting, so I've fixed that and taken Jan's suggestion
> to simply call into the update functions instead of open coding the
> interrupt disable.  I think there still may be some disagreements about
> how to handle guest generated errors in the host, but that's a large
> project whereas this is something we should be doing at reset anyway,
> and even if only a workaround, resolves the problem above.

I don't think there's a disagreement: don't we all agree
they should be forwarded to qemu and on the guest if possible,
ignored if not?

>  hw/device-assignment.c |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..2e6b93e 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
>      const char reset[] = "1";
>      int fd, ret;
>  
> +    /*
> +     * If a guest is reset without being shutdown, MSI/MSI-X can still
> +     * be running.  We want to return the device to a known state on
> +     * reset, so disable those here.

So far so good.

> +     * We especially do not want MSI-X
> +     * enabled since it lives in MMIO space, which is about to get
> +     * disabled.

I think we are better off dropping the above, because it's
a bug that we disable MMIO space on the physical device:
we did not enable it (kvm did) let's not disable.

> +     */
> +    if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSIX) {
> +        uint16_t ctrl = pci_get_word(pci_dev->config +
> +                                     pci_dev->msix_cap + PCI_MSIX_FLAGS);
> +
> +        pci_set_word(pci_dev->config + pci_dev->msix_cap + PCI_MSIX_FLAGS,
> +                     ctrl & ~PCI_MSIX_FLAGS_ENABLE);
> +        assigned_dev_update_msix(pci_dev);
> +    } else if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSI) {
> +        uint8_t ctrl = pci_get_byte(pci_dev->config +
> +                                    pci_dev->msi_cap + PCI_MSI_FLAGS);
> +
> +        pci_set_byte(pci_dev->config + pci_dev->msi_cap + PCI_MSI_FLAGS,
> +                     ctrl & ~PCI_MSI_FLAGS_ENABLE);
> +        assigned_dev_update_msi(pci_dev);
> +    }
> +
>      snprintf(reset_file, sizeof(reset_file),
>               "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/reset",
>               adev->host.seg, adev->host.bus, adev->host.dev, adev->host.func);


I think this patch is a good bugfix. But I also think we must
(separately) do something else as well. Specifically:
we allow guest to access command register. This
is a bad idea since then it can trigger exact same
crash. There are other bits there
that guest has no business triggering.
We can let guest control bus master
and intx mask, the rest should be emulated.




-- 
MST

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-05  9:34 ` Michael S. Tsirkin
@ 2012-04-05 14:42   ` Alex Williamson
  2012-04-05 15:04     ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2012-04-05 14:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, jan.kiszka, jbaron

On Thu, 2012-04-05 at 12:34 +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote:
> > We've hit a kernel host panic, when issuing a 'system_reset' with an
> > 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
> > 
> > [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> > [Hardware Error]: APEI generic hardware error status
> > [Hardware Error]: severity: 1, fatal
> > [Hardware Error]: section: 0, severity: 1, fatal
> > [Hardware Error]: flags: 0x01
> > [Hardware Error]: primary
> > [Hardware Error]: section_type: PCIe error
> > [Hardware Error]: port_type: 0, PCIe end point
> > [Hardware Error]: version: 1.0
> > [Hardware Error]: command: 0x0000, status: 0x0010
> > [Hardware Error]: device_id: 0000:08:00.0
> > [Hardware Error]: slot: 1
> > [Hardware Error]: secondary_bus: 0x00
> > [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> > [Hardware Error]: class_code: 000002
> > [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> > [Hardware Error]: Unsupported Request
> > [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> > [Hardware Error]: aer_uncor_severity: 0x00067011
> > [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> > [Hardware Error]: section: 1, severity: 1, fatal
> > [Hardware Error]: flags: 0x01
> > [Hardware Error]: primary
> > [Hardware Error]: section_type: PCIe error
> > [Hardware Error]: port_type: 0, PCIe end point
> > [Hardware Error]: version: 1.0
> > [Hardware Error]: command: 0x0000, status: 0x0010
> > [Hardware Error]: device_id: 0000:08:00.0
> > [Hardware Error]: slot: 1
> > [Hardware Error]: secondary_bus: 0x00
> > [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> > [Hardware Error]: class_code: 000002
> > [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> > [Hardware Error]: Unsupported Request
> > [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> > [Hardware Error]: aer_uncor_severity: 0x00067011
> > [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> > Kernel panic - not syncing: Fatal hardware error!
> > Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> > Call Trace:
> >  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
> >  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
> >  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
> >  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
> >  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
> >  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
> >  [<ffffffff814f6760>] ? nmi+0x20/0x30
> >  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
> >  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
> >  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
> >  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
> >  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
> >  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
> >  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
> > 
> > The root cause of the problem is that the 'reset_assigned_device()' code
> > first writes a 0 to the command register. Then, when qemu subsequently does
> > a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> > the kernel ends up calling '__msix_mask_irq()', which performs a write to
> > the memory mapped msi vector space. Since, we've explicitly told the device
> > to disallow mmio access (via the 0 write to the command register), we end
> > up with the above 'Unsupported Request'.
> > 
> > The fix here is to first disable MSI-X, before doing the reset.  We also
> > disable MSI, leaving the device in INTx mode.  In this way, the device is
> > a known state after reset, and we avoid touching msi memory mapped space
> > on any subsequent 'kvm_deassign_irq()'.
> > 
> > Thanks to Michael S. Tsirkin for help in understanding what was going on
> > here and Jason Baron, the original debugger of this problem.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> > Jason is out of the office for a couple weeks, so I'll try to resolve
> > this while he's away.  Somehow the emulated config updates were lost
> > in Jason's original posting, so I've fixed that and taken Jan's suggestion
> > to simply call into the update functions instead of open coding the
> > interrupt disable.  I think there still may be some disagreements about
> > how to handle guest generated errors in the host, but that's a large
> > project whereas this is something we should be doing at reset anyway,
> > and even if only a workaround, resolves the problem above.
> 
> I don't think there's a disagreement: don't we all agree
> they should be forwarded to qemu and on the guest if possible,
> ignored if not?

With AER perhaps, but I'm not sure we have that flexibility with APEI,
but I'd need to read up on the spec.  It seems APEI defines that the
BIOS gets first shot at handling the error and gets to dictate how the
OS handles it.  In this case, deciding that it's fatal.

> >  hw/device-assignment.c |   23 +++++++++++++++++++++++
> >  1 files changed, 23 insertions(+), 0 deletions(-)
> > 
> > diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> > index 89823f1..2e6b93e 100644
> > --- a/hw/device-assignment.c
> > +++ b/hw/device-assignment.c
> > @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
> >      const char reset[] = "1";
> >      int fd, ret;
> >  
> > +    /*
> > +     * If a guest is reset without being shutdown, MSI/MSI-X can still
> > +     * be running.  We want to return the device to a known state on
> > +     * reset, so disable those here.
> 
> So far so good.
> 
> > +     * We especially do not want MSI-X
> > +     * enabled since it lives in MMIO space, which is about to get
> > +     * disabled.
> 
> I think we are better off dropping the above, because it's
> a bug that we disable MMIO space on the physical device:
> we did not enable it (kvm did) let's not disable.

I'd like to investigate removing the command register clearing, but it
did solve a bug at the time it went in.  That will be a separate patch,
so we can remove the above sentence when that's removed.  It may be a
bug, but it's describing the current behavior.

> > +     */
> > +    if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSIX) {
> > +        uint16_t ctrl = pci_get_word(pci_dev->config +
> > +                                     pci_dev->msix_cap + PCI_MSIX_FLAGS);
> > +
> > +        pci_set_word(pci_dev->config + pci_dev->msix_cap + PCI_MSIX_FLAGS,
> > +                     ctrl & ~PCI_MSIX_FLAGS_ENABLE);
> > +        assigned_dev_update_msix(pci_dev);
> > +    } else if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSI) {
> > +        uint8_t ctrl = pci_get_byte(pci_dev->config +
> > +                                    pci_dev->msi_cap + PCI_MSI_FLAGS);
> > +
> > +        pci_set_byte(pci_dev->config + pci_dev->msi_cap + PCI_MSI_FLAGS,
> > +                     ctrl & ~PCI_MSI_FLAGS_ENABLE);
> > +        assigned_dev_update_msi(pci_dev);
> > +    }
> > +
> >      snprintf(reset_file, sizeof(reset_file),
> >               "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/reset",
> >               adev->host.seg, adev->host.bus, adev->host.dev, adev->host.func);
> 
> 
> I think this patch is a good bugfix. But I also think we must
> (separately) do something else as well. Specifically:
> we allow guest to access command register. This
> is a bad idea since then it can trigger exact same
> crash. There are other bits there
> that guest has no business triggering.
> We can let guest control bus master
> and intx mask, the rest should be emulated.

Yep, I agree.  An yes, let's handle it in a separate patch.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-05 14:42   ` Alex Williamson
@ 2012-04-05 15:04     ` Michael S. Tsirkin
  0 siblings, 0 replies; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-05 15:04 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, jan.kiszka, jbaron

On Thu, Apr 05, 2012 at 08:42:03AM -0600, Alex Williamson wrote:
> > So far so good.
> > 
> > > +     * We especially do not want MSI-X
> > > +     * enabled since it lives in MMIO space, which is about to get
> > > +     * disabled.
> > 
> > I think we are better off dropping the above, because it's
> > a bug that we disable MMIO space on the physical device:
> > we did not enable it (kvm did) let's not disable.
> 
> I'd like to investigate removing the command register clearing, but it
> did solve a bug at the time it went in.  That will be a separate patch,
> so we can remove the above sentence when that's removed.  It may be a
> bug, but it's describing the current behavior.

Fair enough.
Acked-by: Michael S. Tsirkin <mst@redhat.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-05  3:42 [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path Alex Williamson
  2012-04-05  7:28 ` Jan Kiszka
  2012-04-05  9:34 ` Michael S. Tsirkin
@ 2012-04-08 13:14 ` Avi Kivity
  2012-04-08 13:17   ` Michael S. Tsirkin
  2012-04-16 14:03 ` Alex Williamson
  2012-04-17  0:55 ` Marcelo Tosatti
  4 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-08 13:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, jan.kiszka, mst, jbaron

On 04/05/2012 06:42 AM, Alex Williamson wrote:
> We've hit a kernel host panic, when issuing a 'system_reset' with an
> 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
>
> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> [Hardware Error]: APEI generic hardware error status
> [Hardware Error]: severity: 1, fatal
> [Hardware Error]: section: 0, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> [Hardware Error]: section: 1, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> Kernel panic - not syncing: Fatal hardware error!
> Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> Call Trace:
>  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
>  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
>  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
>  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
>  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
>  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
>  [<ffffffff814f6760>] ? nmi+0x20/0x30
>  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
>  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
>  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
>  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
>  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
>  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
>  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
>
> The root cause of the problem is that the 'reset_assigned_device()' code
> first writes a 0 to the command register. Then, when qemu subsequently does
> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> the memory mapped msi vector space. Since, we've explicitly told the device
> to disallow mmio access (via the 0 write to the command register), we end
> up with the above 'Unsupported Request'.
>
> The fix here is to first disable MSI-X, before doing the reset.  We also
> disable MSI, leaving the device in INTx mode.  In this way, the device is
> a known state after reset, and we avoid touching msi memory mapped space
> on any subsequent 'kvm_deassign_irq()'.
>
> Thanks to Michael S. Tsirkin for help in understanding what was going on
> here and Jason Baron, the original debugger of this problem.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>
> Jason is out of the office for a couple weeks, so I'll try to resolve
> this while he's away.  Somehow the emulated config updates were lost
> in Jason's original posting, so I've fixed that and taken Jan's suggestion
> to simply call into the update functions instead of open coding the
> interrupt disable.  I think there still may be some disagreements about
> how to handle guest generated errors in the host, but that's a large
> project whereas this is something we should be doing at reset anyway,
> and even if only a workaround, resolves the problem above.
>
>  hw/device-assignment.c |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..2e6b93e 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
>      const char reset[] = "1";
>      int fd, ret;
>  
> +    /*
> +     * If a guest is reset without being shutdown, MSI/MSI-X can still
> +     * be running.  We want to return the device to a known state on
> +     * reset, so disable those here.  We especially do not want MSI-X
> +     * enabled since it lives in MMIO space, which is about to get
> +     * disabled.
> +     */
> +    if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSIX) {
> +        uint16_t ctrl = pci_get_word(pci_dev->config +
> +                                     pci_dev->msix_cap + PCI_MSIX_FLAGS);
> +
> +        pci_set_word(pci_dev->config + pci_dev->msix_cap + PCI_MSIX_FLAGS,
> +                     ctrl & ~PCI_MSIX_FLAGS_ENABLE);
> +        assigned_dev_update_msix(pci_dev);
> +    } else if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSI) {
> +        uint8_t ctrl = pci_get_byte(pci_dev->config +
> +                                    pci_dev->msi_cap + PCI_MSI_FLAGS);
> +
> +        pci_set_byte(pci_dev->config + pci_dev->msi_cap + PCI_MSI_FLAGS,
> +                     ctrl & ~PCI_MSI_FLAGS_ENABLE);
> +        assigned_dev_update_msi(pci_dev);
> +    }
> +


Don't we FLR the device, which ought to disable MSI on the real device?

So it seems to me the correct approach is to synchronize the emulated
config space from the real config space after FLR.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 13:14 ` Avi Kivity
@ 2012-04-08 13:17   ` Michael S. Tsirkin
  2012-04-08 13:18     ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 13:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On Sun, Apr 08, 2012 at 04:14:29PM +0300, Avi Kivity wrote:
> On 04/05/2012 06:42 AM, Alex Williamson wrote:
> > We've hit a kernel host panic, when issuing a 'system_reset' with an
> > 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
> >
> > [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> > [Hardware Error]: APEI generic hardware error status
> > [Hardware Error]: severity: 1, fatal
> > [Hardware Error]: section: 0, severity: 1, fatal
> > [Hardware Error]: flags: 0x01
> > [Hardware Error]: primary
> > [Hardware Error]: section_type: PCIe error
> > [Hardware Error]: port_type: 0, PCIe end point
> > [Hardware Error]: version: 1.0
> > [Hardware Error]: command: 0x0000, status: 0x0010
> > [Hardware Error]: device_id: 0000:08:00.0
> > [Hardware Error]: slot: 1
> > [Hardware Error]: secondary_bus: 0x00
> > [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> > [Hardware Error]: class_code: 000002
> > [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> > [Hardware Error]: Unsupported Request
> > [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> > [Hardware Error]: aer_uncor_severity: 0x00067011
> > [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> > [Hardware Error]: section: 1, severity: 1, fatal
> > [Hardware Error]: flags: 0x01
> > [Hardware Error]: primary
> > [Hardware Error]: section_type: PCIe error
> > [Hardware Error]: port_type: 0, PCIe end point
> > [Hardware Error]: version: 1.0
> > [Hardware Error]: command: 0x0000, status: 0x0010
> > [Hardware Error]: device_id: 0000:08:00.0
> > [Hardware Error]: slot: 1
> > [Hardware Error]: secondary_bus: 0x00
> > [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> > [Hardware Error]: class_code: 000002
> > [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> > [Hardware Error]: Unsupported Request
> > [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> > [Hardware Error]: aer_uncor_severity: 0x00067011
> > [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> > Kernel panic - not syncing: Fatal hardware error!
> > Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> > Call Trace:
> >  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
> >  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
> >  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
> >  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
> >  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
> >  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
> >  [<ffffffff814f6760>] ? nmi+0x20/0x30
> >  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
> >  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
> >  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
> >  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
> >  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
> >  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
> >  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
> >
> > The root cause of the problem is that the 'reset_assigned_device()' code
> > first writes a 0 to the command register. Then, when qemu subsequently does
> > a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> > the kernel ends up calling '__msix_mask_irq()', which performs a write to
> > the memory mapped msi vector space. Since, we've explicitly told the device
> > to disallow mmio access (via the 0 write to the command register), we end
> > up with the above 'Unsupported Request'.
> >
> > The fix here is to first disable MSI-X, before doing the reset.  We also
> > disable MSI, leaving the device in INTx mode.  In this way, the device is
> > a known state after reset, and we avoid touching msi memory mapped space
> > on any subsequent 'kvm_deassign_irq()'.
> >
> > Thanks to Michael S. Tsirkin for help in understanding what was going on
> > here and Jason Baron, the original debugger of this problem.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >
> > Jason is out of the office for a couple weeks, so I'll try to resolve
> > this while he's away.  Somehow the emulated config updates were lost
> > in Jason's original posting, so I've fixed that and taken Jan's suggestion
> > to simply call into the update functions instead of open coding the
> > interrupt disable.  I think there still may be some disagreements about
> > how to handle guest generated errors in the host, but that's a large
> > project whereas this is something we should be doing at reset anyway,
> > and even if only a workaround, resolves the problem above.
> >
> >  hw/device-assignment.c |   23 +++++++++++++++++++++++
> >  1 files changed, 23 insertions(+), 0 deletions(-)
> >
> > diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> > index 89823f1..2e6b93e 100644
> > --- a/hw/device-assignment.c
> > +++ b/hw/device-assignment.c
> > @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
> >      const char reset[] = "1";
> >      int fd, ret;
> >  
> > +    /*
> > +     * If a guest is reset without being shutdown, MSI/MSI-X can still
> > +     * be running.  We want to return the device to a known state on
> > +     * reset, so disable those here.  We especially do not want MSI-X
> > +     * enabled since it lives in MMIO space, which is about to get
> > +     * disabled.
> > +     */
> > +    if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSIX) {
> > +        uint16_t ctrl = pci_get_word(pci_dev->config +
> > +                                     pci_dev->msix_cap + PCI_MSIX_FLAGS);
> > +
> > +        pci_set_word(pci_dev->config + pci_dev->msix_cap + PCI_MSIX_FLAGS,
> > +                     ctrl & ~PCI_MSIX_FLAGS_ENABLE);
> > +        assigned_dev_update_msix(pci_dev);
> > +    } else if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSI) {
> > +        uint8_t ctrl = pci_get_byte(pci_dev->config +
> > +                                    pci_dev->msi_cap + PCI_MSI_FLAGS);
> > +
> > +        pci_set_byte(pci_dev->config + pci_dev->msi_cap + PCI_MSI_FLAGS,
> > +                     ctrl & ~PCI_MSI_FLAGS_ENABLE);
> > +        assigned_dev_update_msi(pci_dev);
> > +    }
> > +
> 
> 
> Don't we FLR the device, which ought to disable MSI on the real device?

AFAIK we call pci_reset, which saves device state, does an FLR
and then restores the state. I think this might include msi as well.

> So it seems to me the correct approach is to synchronize the emulated
> config space from the real config space after FLR.
> 
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 13:17   ` Michael S. Tsirkin
@ 2012-04-08 13:18     ` Avi Kivity
  2012-04-08 13:21       ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-08 13:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > 
> > Don't we FLR the device, which ought to disable MSI on the real device?
>
> AFAIK we call pci_reset, which saves device state, does an FLR
> and then restores the state. I think this might include msi as well.
>
>

Then that is wrong as well, no?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 13:18     ` Avi Kivity
@ 2012-04-08 13:21       ` Michael S. Tsirkin
  2012-04-08 13:24         ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 13:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > 
> > > Don't we FLR the device, which ought to disable MSI on the real device?
> >
> > AFAIK we call pci_reset, which saves device state, does an FLR
> > and then restores the state. I think this might include msi as well.
> >
> >
> 
> Then that is wrong as well, no?

Not as such assuming we disable msi/msix first :)

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 13:21       ` Michael S. Tsirkin
@ 2012-04-08 13:24         ` Avi Kivity
  2012-04-08 13:30           ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-08 13:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> > On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > > 
> > > > Don't we FLR the device, which ought to disable MSI on the real device?
> > >
> > > AFAIK we call pci_reset, which saves device state, does an FLR
> > > and then restores the state. I think this might include msi as well.
> > >
> > >
> > 
> > Then that is wrong as well, no?
>
> Not as such assuming we disable msi/msix first :)

I think we need to fix both, no?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 13:24         ` Avi Kivity
@ 2012-04-08 13:30           ` Michael S. Tsirkin
  2012-04-08 13:41             ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 13:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
> On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
> > On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> > > On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > > > 
> > > > > Don't we FLR the device, which ought to disable MSI on the real device?
> > > >
> > > > AFAIK we call pci_reset, which saves device state, does an FLR
> > > > and then restores the state. I think this might include msi as well.
> > > >
> > > >
> > > 
> > > Then that is wrong as well, no?
> >
> > Not as such assuming we disable msi/msix first :)
> 
> I think we need to fix both, no?

Isn't this what this patch does?

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 13:30           ` Michael S. Tsirkin
@ 2012-04-08 13:41             ` Avi Kivity
  2012-04-08 13:53               ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-08 13:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
> > On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
> > > On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> > > > On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > > Don't we FLR the device, which ought to disable MSI on the real device?
> > > > >
> > > > > AFAIK we call pci_reset, which saves device state, does an FLR
> > > > > and then restores the state. I think this might include msi as well.
> > > > >
> > > > >
> > > > 
> > > > Then that is wrong as well, no?
> > >
> > > Not as such assuming we disable msi/msix first :)
> > 
> > I think we need to fix both, no?
>
> Isn't this what this patch does?

If we change pci_reset() (or a variant that we call) to reset MSI, and
update qemu to synchronize from the device after pci_reset(), then we
achieve the same result, in a different way.

Since reset can change other config space registers, we achieve
correctness for more of them.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 13:41             ` Avi Kivity
@ 2012-04-08 13:53               ` Michael S. Tsirkin
  2012-04-08 14:01                 ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 13:53 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
> On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
> > On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
> > > On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
> > > > On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> > > > > On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > > > > > 
> > > > > > > Don't we FLR the device, which ought to disable MSI on the real device?
> > > > > >
> > > > > > AFAIK we call pci_reset, which saves device state, does an FLR
> > > > > > and then restores the state. I think this might include msi as well.
> > > > > >
> > > > > >
> > > > > 
> > > > > Then that is wrong as well, no?
> > > >
> > > > Not as such assuming we disable msi/msix first :)
> > > 
> > > I think we need to fix both, no?
> >
> > Isn't this what this patch does?
> 
> If we change pci_reset() (or a variant that we call) to reset MSI, and
> update qemu to synchronize from the device after pci_reset(), then we
> achieve the same result, in a different way.

MSI vectors are set up by kvm in the host. So we should not
abruptly drop that by a sysfs write: would need to
synchronise with kvm. Once we do, there's nothing left
for pci_reset to do.


> Since reset can change other config space registers, we achieve
> correctness for more of them.

Which other registers do you have in mind?

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 13:53               ` Michael S. Tsirkin
@ 2012-04-08 14:01                 ` Avi Kivity
  2012-04-08 14:42                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-08 14:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
> > On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
> > > On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
> > > > On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
> > > > > On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> > > > > > On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > > > > > > 
> > > > > > > > Don't we FLR the device, which ought to disable MSI on the real device?
> > > > > > >
> > > > > > > AFAIK we call pci_reset, which saves device state, does an FLR
> > > > > > > and then restores the state. I think this might include msi as well.
> > > > > > >
> > > > > > >
> > > > > > 
> > > > > > Then that is wrong as well, no?
> > > > >
> > > > > Not as such assuming we disable msi/msix first :)
> > > > 
> > > > I think we need to fix both, no?
> > >
> > > Isn't this what this patch does?
> > 
> > If we change pci_reset() (or a variant that we call) to reset MSI, and
> > update qemu to synchronize from the device after pci_reset(), then we
> > achieve the same result, in a different way.
>
> MSI vectors are set up by kvm in the host. So we should not
> abruptly drop that by a sysfs write: would need to
> synchronise with kvm. Once we do, there's nothing left
> for pci_reset to do.

I'm thinking about this flow:

  FLR the device
  for each emulated register
     read it from the hardware
     if different from emulated register:
        update the internal model (for example, disabling MSI in kvm if
needed)
        set emulated register to hardware value

> > Since reset can change other config space registers, we achieve
> > correctness for more of them.
>
> Which other registers do you have in mind?

BARs for example.  We may have our own reset for this, but isn't copying
the hardware values more trustworthy?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 14:01                 ` Avi Kivity
@ 2012-04-08 14:42                   ` Michael S. Tsirkin
  2012-04-08 15:26                     ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 14:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
> On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
> > On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
> > > On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
> > > > On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
> > > > > On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
> > > > > > On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> > > > > > > On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > > > > > > > 
> > > > > > > > > Don't we FLR the device, which ought to disable MSI on the real device?
> > > > > > > >
> > > > > > > > AFAIK we call pci_reset, which saves device state, does an FLR
> > > > > > > > and then restores the state. I think this might include msi as well.
> > > > > > > >
> > > > > > > >
> > > > > > > 
> > > > > > > Then that is wrong as well, no?
> > > > > >
> > > > > > Not as such assuming we disable msi/msix first :)
> > > > > 
> > > > > I think we need to fix both, no?
> > > >
> > > > Isn't this what this patch does?
> > > 
> > > If we change pci_reset() (or a variant that we call) to reset MSI, and
> > > update qemu to synchronize from the device after pci_reset(), then we
> > > achieve the same result, in a different way.
> >
> > MSI vectors are set up by kvm in the host. So we should not
> > abruptly drop that by a sysfs write: would need to
> > synchronise with kvm. Once we do, there's nothing left
> > for pci_reset to do.
> 
> I'm thinking about this flow:
> 
>   FLR the device
>   for each emulated register
>      read it from the hardware
>      if different from emulated register:
>         update the internal model (for example, disabling MSI in kvm if
> needed)

If we do it this way we get back the problem this patch
is trying to solve: MSIX assigned while device
memory is disabled would cause unsupported request errors.

>         set emulated register to hardware value

Yes, I see what you are trying to say now.
Unfortunately that's not enough: we also
need to restore the registers afterwards for
device to become useful again.
Doing this in kernel seems more robust, otherwise
we risk losing the device if qemu gets killed
before it has restored the registers.


> > > Since reset can change other config space registers, we achieve
> > > correctness for more of them.
> >
> > Which other registers do you have in mind?
> 
> BARs for example.  We may have our own reset for this, but isn't copying
> the hardware values more trustworthy?

BAR values in host and guest are unrelated.
If pci_reset didn't restore BAR values we won't
be able to operate the device.


> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 14:42                   ` Michael S. Tsirkin
@ 2012-04-08 15:26                     ` Avi Kivity
  2012-04-08 15:46                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-08 15:26 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
> > On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
> > > On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
> > > > On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
> > > > > On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
> > > > > > On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
> > > > > > > On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> > > > > > > > On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > 
> > > > > > > > > > Don't we FLR the device, which ought to disable MSI on the real device?
> > > > > > > > >
> > > > > > > > > AFAIK we call pci_reset, which saves device state, does an FLR
> > > > > > > > > and then restores the state. I think this might include msi as well.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > 
> > > > > > > > Then that is wrong as well, no?
> > > > > > >
> > > > > > > Not as such assuming we disable msi/msix first :)
> > > > > > 
> > > > > > I think we need to fix both, no?
> > > > >
> > > > > Isn't this what this patch does?
> > > > 
> > > > If we change pci_reset() (or a variant that we call) to reset MSI, and
> > > > update qemu to synchronize from the device after pci_reset(), then we
> > > > achieve the same result, in a different way.
> > >
> > > MSI vectors are set up by kvm in the host. So we should not
> > > abruptly drop that by a sysfs write: would need to
> > > synchronise with kvm. Once we do, there's nothing left
> > > for pci_reset to do.
> > 
> > I'm thinking about this flow:
> > 
> >   FLR the device
> >   for each emulated register
> >      read it from the hardware
> >      if different from emulated register:
> >         update the internal model (for example, disabling MSI in kvm if
> > needed)
>
> If we do it this way we get back the problem this patch
> is trying to solve: MSIX assigned while device
> memory is disabled would cause unsupported request errors.

Why is that?  FLR would presumably disable MSI in the device, and this
line would disable it in kvm as well.

> >         set emulated register to hardware value
>
> Yes, I see what you are trying to say now.
> Unfortunately that's not enough: we also
> need to restore the registers afterwards for
> device to become useful again.

I guess this is correct for the MSIX BAR.  But is it correct for MSIX
enable/disable?

> Doing this in kernel seems more robust, otherwise
> we risk losing the device if qemu gets killed
> before it has restored the registers.

Doesn't the driver have to enable MSIX if it attaches to the device at
that point, anyway?


> > > > Since reset can change other config space registers, we achieve
> > > > correctness for more of them.
> > >
> > > Which other registers do you have in mind?
> > 
> > BARs for example.  We may have our own reset for this, but isn't copying
> > the hardware values more trustworthy?
>
> BAR values in host and guest are unrelated.
> If pci_reset didn't restore BAR values we won't
> be able to operate the device.
>

Right.

I guess candidates are those that are initialized with
assigned_dev_emulate_config_*()?  Hard to see which ones because they're
mass initialized.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 15:26                     ` Avi Kivity
@ 2012-04-08 15:46                       ` Michael S. Tsirkin
  2012-04-08 15:50                         ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 15:46 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On Sun, Apr 08, 2012 at 06:26:28PM +0300, Avi Kivity wrote:
> On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote:
> > On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
> > > On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
> > > > On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
> > > > > On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
> > > > > > On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
> > > > > > > On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
> > > > > > > > On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
> > > > > > > > > On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Don't we FLR the device, which ought to disable MSI on the real device?
> > > > > > > > > >
> > > > > > > > > > AFAIK we call pci_reset, which saves device state, does an FLR
> > > > > > > > > > and then restores the state. I think this might include msi as well.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > 
> > > > > > > > > Then that is wrong as well, no?
> > > > > > > >
> > > > > > > > Not as such assuming we disable msi/msix first :)
> > > > > > > 
> > > > > > > I think we need to fix both, no?
> > > > > >
> > > > > > Isn't this what this patch does?
> > > > > 
> > > > > If we change pci_reset() (or a variant that we call) to reset MSI, and
> > > > > update qemu to synchronize from the device after pci_reset(), then we
> > > > > achieve the same result, in a different way.
> > > >
> > > > MSI vectors are set up by kvm in the host. So we should not
> > > > abruptly drop that by a sysfs write: would need to
> > > > synchronise with kvm. Once we do, there's nothing left
> > > > for pci_reset to do.
> > > 
> > > I'm thinking about this flow:
> > > 
> > >   FLR the device
> > >   for each emulated register
> > >      read it from the hardware
> > >      if different from emulated register:
> > >         update the internal model (for example, disabling MSI in kvm if
> > > needed)
> >
> > If we do it this way we get back the problem this patch
> > is trying to solve: MSIX assigned while device
> > memory is disabled would cause unsupported request errors.
> 
> Why is that?  FLR would presumably disable MSI in the device, and this
> line would disable it in kvm as well.

The bug is that device memory is disabled (FLR would do that)
while MSI is enabled in kvm. The fix is to
disable MSI in kvm first.

> > >         set emulated register to hardware value
> >
> > Yes, I see what you are trying to say now.
> > Unfortunately that's not enough: we also
> > need to restore the registers afterwards for
> > device to become useful again.
> 
> I guess this is correct for the MSIX BAR.  But is it correct for MSIX
> enable/disable?

Probably not.

> > Doing this in kernel seems more robust, otherwise
> > we risk losing the device if qemu gets killed
> > before it has restored the registers.
> 
> Doesn't the driver have to enable MSIX if it attaches to the device at
> that point, anyway?

Yes. I'm talking about things like enabling memory, setting up irq register,
etc though. Most of this setup is done by bios.

> > > > > Since reset can change other config space registers, we achieve
> > > > > correctness for more of them.
> > > >
> > > > Which other registers do you have in mind?
> > > 
> > > BARs for example.  We may have our own reset for this, but isn't copying
> > > the hardware values more trustworthy?
> >
> > BAR values in host and guest are unrelated.
> > If pci_reset didn't restore BAR values we won't
> > be able to operate the device.
> >
> 
> Right.
> 
> I guess candidates are those that are initialized with
> assigned_dev_emulate_config_*()?  Hard to see which ones because they're
> mass initialized.
> 
> 
> 
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 15:46                       ` Michael S. Tsirkin
@ 2012-04-08 15:50                         ` Avi Kivity
  2012-04-08 16:04                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-08 15:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
> > > > 
> > > > I'm thinking about this flow:
> > > > 
> > > >   FLR the device
> > > >   for each emulated register
> > > >      read it from the hardware
> > > >      if different from emulated register:
> > > >         update the internal model (for example, disabling MSI in kvm if
> > > > needed)
> > >
> > > If we do it this way we get back the problem this patch
> > > is trying to solve: MSIX assigned while device
> > > memory is disabled would cause unsupported request errors.
> > 
> > Why is that?  FLR would presumably disable MSI in the device, and this
> > line would disable it in kvm as well.
>
> The bug is that device memory is disabled (FLR would do that)
> while MSI is enabled in kvm. The fix is to
> disable MSI in kvm first.

Yes, no need to repeat.  My question is whether my pseudo-code does the
same and whether or not if it is better (when applied to all emulated
config space).

> > > Doing this in kernel seems more robust, otherwise
> > > we risk losing the device if qemu gets killed
> > > before it has restored the registers.
> > 
> > Doesn't the driver have to enable MSIX if it attaches to the device at
> > that point, anyway?
>
> Yes. I'm talking about things like enabling memory, setting up irq register,
> etc though. Most of this setup is done by bios.

I see.  So should we have a pci_reset_function() variant that limits
itself to restoring just those bits?



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 15:50                         ` Avi Kivity
@ 2012-04-08 16:04                           ` Michael S. Tsirkin
  2012-04-08 16:08                             ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 16:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
> On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
> > > > > 
> > > > > I'm thinking about this flow:
> > > > > 
> > > > >   FLR the device
> > > > >   for each emulated register
> > > > >      read it from the hardware
> > > > >      if different from emulated register:
> > > > >         update the internal model (for example, disabling MSI in kvm if
> > > > > needed)
> > > >
> > > > If we do it this way we get back the problem this patch
> > > > is trying to solve: MSIX assigned while device
> > > > memory is disabled would cause unsupported request errors.
> > > 
> > > Why is that?  FLR would presumably disable MSI in the device, and this
> > > line would disable it in kvm as well.
> >
> > The bug is that device memory is disabled (FLR would do that)
> > while MSI is enabled in kvm. The fix is to
> > disable MSI in kvm first.
> 
> Yes, no need to repeat.  My question is whether my pseudo-code does the
> same

It doesn't seem to: FLR (disabling memory) is followed
by MSI disable in kvm instead of the reverse.

> and whether or not if it is better (when applied to all emulated
> config space).

I'm not sure.
I would like to see an example of a register that you have
in mind.

> > > > Doing this in kernel seems more robust, otherwise
> > > > we risk losing the device if qemu gets killed
> > > > before it has restored the registers.
> > > 
> > > Doesn't the driver have to enable MSIX if it attaches to the device at
> > > that point, anyway?
> >
> > Yes. I'm talking about things like enabling memory, setting up irq register,
> > etc though. Most of this setup is done by bios.
> 
> I see.  So should we have a pci_reset_function() variant that limits
> itself to restoring just those bits?

We only need kernel to restore whatever qemu emulates, but
kernel doesn't know what that is.
What kind of interface do you have in mind?

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 16:04                           ` Michael S. Tsirkin
@ 2012-04-08 16:08                             ` Avi Kivity
  2012-04-08 17:37                               ` Jan Kiszka
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-08 16:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, kvm, jan.kiszka, jbaron

On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
> > On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > > I'm thinking about this flow:
> > > > > > 
> > > > > >   FLR the device
> > > > > >   for each emulated register
> > > > > >      read it from the hardware
> > > > > >      if different from emulated register:
> > > > > >         update the internal model (for example, disabling MSI in kvm if
> > > > > > needed)
> > > > >
> > > > > If we do it this way we get back the problem this patch
> > > > > is trying to solve: MSIX assigned while device
> > > > > memory is disabled would cause unsupported request errors.
> > > > 
> > > > Why is that?  FLR would presumably disable MSI in the device, and this
> > > > line would disable it in kvm as well.
> > >
> > > The bug is that device memory is disabled (FLR would do that)
> > > while MSI is enabled in kvm. The fix is to
> > > disable MSI in kvm first.
> > 
> > Yes, no need to repeat.  My question is whether my pseudo-code does the
> > same
>
> It doesn't seem to: FLR (disabling memory) is followed
> by MSI disable in kvm instead of the reverse.

Ah, so the problem is the ordering?  I see.

> > and whether or not if it is better (when applied to all emulated
> > config space).
>
> I'm not sure.
> I would like to see an example of a register that you have
> in mind.

I went over the PCI registers and saw none that would be affected.

> > >
> > > Yes. I'm talking about things like enabling memory, setting up irq register,
> > > etc though. Most of this setup is done by bios.
> > 
> > I see.  So should we have a pci_reset_function() variant that limits
> > itself to restoring just those bits?
>
> We only need kernel to restore whatever qemu emulates, but
> kernel doesn't know what that is.
> What kind of interface do you have in mind?
>

The same as pci_reset_function(), but leaves MSI clear.

I guess it's not worth it if the ordering problem is there.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 16:08                             ` Avi Kivity
@ 2012-04-08 17:37                               ` Jan Kiszka
  2012-04-08 18:18                                 ` Michael S. Tsirkin
  2012-04-09  8:35                                 ` Avi Kivity
  0 siblings, 2 replies; 41+ messages in thread
From: Jan Kiszka @ 2012-04-08 17:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, Alex Williamson, kvm, jbaron

[-- Attachment #1: Type: text/plain, Size: 2894 bytes --]

On 2012-04-08 18:08, Avi Kivity wrote:
> On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
>> On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
>>> On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
>>>>>>>
>>>>>>> I'm thinking about this flow:
>>>>>>>
>>>>>>>   FLR the device
>>>>>>>   for each emulated register
>>>>>>>      read it from the hardware
>>>>>>>      if different from emulated register:
>>>>>>>         update the internal model (for example, disabling MSI in kvm if
>>>>>>> needed)
>>>>>>
>>>>>> If we do it this way we get back the problem this patch
>>>>>> is trying to solve: MSIX assigned while device
>>>>>> memory is disabled would cause unsupported request errors.
>>>>>
>>>>> Why is that?  FLR would presumably disable MSI in the device, and this
>>>>> line would disable it in kvm as well.
>>>>
>>>> The bug is that device memory is disabled (FLR would do that)
>>>> while MSI is enabled in kvm. The fix is to
>>>> disable MSI in kvm first.
>>>
>>> Yes, no need to repeat.  My question is whether my pseudo-code does the
>>> same
>>
>> It doesn't seem to: FLR (disabling memory) is followed
>> by MSI disable in kvm instead of the reverse.
> 
> Ah, so the problem is the ordering?  I see.
> 
>>> and whether or not if it is better (when applied to all emulated
>>> config space).
>>
>> I'm not sure.
>> I would like to see an example of a register that you have
>> in mind.
> 
> I went over the PCI registers and saw none that would be affected.
> 
>>>>
>>>> Yes. I'm talking about things like enabling memory, setting up irq register,
>>>> etc though. Most of this setup is done by bios.
>>>
>>> I see.  So should we have a pci_reset_function() variant that limits
>>> itself to restoring just those bits?
>>
>> We only need kernel to restore whatever qemu emulates, but
>> kernel doesn't know what that is.
>> What kind of interface do you have in mind?
>>
> 
> The same as pci_reset_function(), but leaves MSI clear.
> 
> I guess it's not worth it if the ordering problem is there.

The core problem is not the ordering. The problem is that the kernel is
susceptible to ordering mistakes of userspace. And that is because the
kernel panics on PCI errors of devices that are in user hands - a
critical kernel bug IMHO. Proper reset of MSI or even the whole PCI
config space is another issue, but one the kernel should not worry about
- still, it should be fixed (therefore this patch).

But even if we disallowed userland to disable MMIO and PIO access to the
device, we would be be able to exclude that there are secrete channels
in the device's interface having the same effect. So we likely need to
enhance PCI error handling to catch and handle faults for certain
devices differently - those we cannot trust to behave properly while
they are under userland/guest control.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 17:37                               ` Jan Kiszka
@ 2012-04-08 18:18                                 ` Michael S. Tsirkin
  2012-04-08 18:39                                   ` Jan Kiszka
  2012-04-09  8:35                                 ` Avi Kivity
  1 sibling, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 18:18 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, Alex Williamson, kvm, jbaron

On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
> On 2012-04-08 18:08, Avi Kivity wrote:
> > On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
> >> On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
> >>> On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
> >>>>>>>
> >>>>>>> I'm thinking about this flow:
> >>>>>>>
> >>>>>>>   FLR the device
> >>>>>>>   for each emulated register
> >>>>>>>      read it from the hardware
> >>>>>>>      if different from emulated register:
> >>>>>>>         update the internal model (for example, disabling MSI in kvm if
> >>>>>>> needed)
> >>>>>>
> >>>>>> If we do it this way we get back the problem this patch
> >>>>>> is trying to solve: MSIX assigned while device
> >>>>>> memory is disabled would cause unsupported request errors.
> >>>>>
> >>>>> Why is that?  FLR would presumably disable MSI in the device, and this
> >>>>> line would disable it in kvm as well.
> >>>>
> >>>> The bug is that device memory is disabled (FLR would do that)
> >>>> while MSI is enabled in kvm. The fix is to
> >>>> disable MSI in kvm first.
> >>>
> >>> Yes, no need to repeat.  My question is whether my pseudo-code does the
> >>> same
> >>
> >> It doesn't seem to: FLR (disabling memory) is followed
> >> by MSI disable in kvm instead of the reverse.
> > 
> > Ah, so the problem is the ordering?  I see.
> > 
> >>> and whether or not if it is better (when applied to all emulated
> >>> config space).
> >>
> >> I'm not sure.
> >> I would like to see an example of a register that you have
> >> in mind.
> > 
> > I went over the PCI registers and saw none that would be affected.
> > 
> >>>>
> >>>> Yes. I'm talking about things like enabling memory, setting up irq register,
> >>>> etc though. Most of this setup is done by bios.
> >>>
> >>> I see.  So should we have a pci_reset_function() variant that limits
> >>> itself to restoring just those bits?
> >>
> >> We only need kernel to restore whatever qemu emulates, but
> >> kernel doesn't know what that is.
> >> What kind of interface do you have in mind?
> >>
> > 
> > The same as pci_reset_function(), but leaves MSI clear.
> > 
> > I guess it's not worth it if the ordering problem is there.
> 
> The core problem is not the ordering. The problem is that the kernel is
> susceptible to ordering mistakes of userspace.
> And that is because the
> kernel panics on PCI errors of devices that are in user hands - a
> critical kernel bug IMHO.

I'm not sure. The pci sysfs interface
is by design not secured against malicious users,
isn't it?

> Proper reset of MSI or even the whole PCI
> config space is another issue, but one the kernel should not worry about
> - still, it should be fixed (therefore this patch).
> But even if we disallowed userland to disable MMIO and PIO access to the
> device, we would be be able to exclude that there are secrete channels
> in the device's interface having the same effect.

I'm not sure I agree here.  If there are secret channels to the device
that let it violate the PCI express spec, it can probably break the SRIOV
security model. And then you can do much more than just crash the host.

> So we likely need to
> enhance PCI error handling to catch and handle faults for certain
> devices differently - those we cannot trust to behave properly while
> they are under userland/guest control.
> 
> Jan
> 


I agree - forwarding errors to the guest would actually be very useful - but
I think we also need to analyse the problem carefully,
and prevent as many ways as we can for guest to cause trouble.

And there is another issue here: unsuppported request errors
should not cause kernel panics IMO.

There's also the issue that qemu let guest control the MMIO/PIO
bits in the command register.

So there are multiple bugs.

-- 
MST

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 18:18                                 ` Michael S. Tsirkin
@ 2012-04-08 18:39                                   ` Jan Kiszka
  2012-04-08 20:35                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kiszka @ 2012-04-08 18:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Alex Williamson, kvm, jbaron

[-- Attachment #1: Type: text/plain, Size: 4670 bytes --]

On 2012-04-08 20:18, Michael S. Tsirkin wrote:
> On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
>> On 2012-04-08 18:08, Avi Kivity wrote:
>>> On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
>>>> On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
>>>>> On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
>>>>>>>>>
>>>>>>>>> I'm thinking about this flow:
>>>>>>>>>
>>>>>>>>>   FLR the device
>>>>>>>>>   for each emulated register
>>>>>>>>>      read it from the hardware
>>>>>>>>>      if different from emulated register:
>>>>>>>>>         update the internal model (for example, disabling MSI in kvm if
>>>>>>>>> needed)
>>>>>>>>
>>>>>>>> If we do it this way we get back the problem this patch
>>>>>>>> is trying to solve: MSIX assigned while device
>>>>>>>> memory is disabled would cause unsupported request errors.
>>>>>>>
>>>>>>> Why is that?  FLR would presumably disable MSI in the device, and this
>>>>>>> line would disable it in kvm as well.
>>>>>>
>>>>>> The bug is that device memory is disabled (FLR would do that)
>>>>>> while MSI is enabled in kvm. The fix is to
>>>>>> disable MSI in kvm first.
>>>>>
>>>>> Yes, no need to repeat.  My question is whether my pseudo-code does the
>>>>> same
>>>>
>>>> It doesn't seem to: FLR (disabling memory) is followed
>>>> by MSI disable in kvm instead of the reverse.
>>>
>>> Ah, so the problem is the ordering?  I see.
>>>
>>>>> and whether or not if it is better (when applied to all emulated
>>>>> config space).
>>>>
>>>> I'm not sure.
>>>> I would like to see an example of a register that you have
>>>> in mind.
>>>
>>> I went over the PCI registers and saw none that would be affected.
>>>
>>>>>>
>>>>>> Yes. I'm talking about things like enabling memory, setting up irq register,
>>>>>> etc though. Most of this setup is done by bios.
>>>>>
>>>>> I see.  So should we have a pci_reset_function() variant that limits
>>>>> itself to restoring just those bits?
>>>>
>>>> We only need kernel to restore whatever qemu emulates, but
>>>> kernel doesn't know what that is.
>>>> What kind of interface do you have in mind?
>>>>
>>>
>>> The same as pci_reset_function(), but leaves MSI clear.
>>>
>>> I guess it's not worth it if the ordering problem is there.
>>
>> The core problem is not the ordering. The problem is that the kernel is
>> susceptible to ordering mistakes of userspace.
>> And that is because the
>> kernel panics on PCI errors of devices that are in user hands - a
>> critical kernel bug IMHO.
> 
> I'm not sure. The pci sysfs interface
> is by design not secured against malicious users,
> isn't it?

That's surely true for devices outside of IOMMU protection. But do we
really have to give up when we encapsulate and isolate them that way?
Provided we moderate access to the sysfs resources via libvirt or some
other management service.

> 
>> Proper reset of MSI or even the whole PCI
>> config space is another issue, but one the kernel should not worry about
>> - still, it should be fixed (therefore this patch).
>> But even if we disallowed userland to disable MMIO and PIO access to the
>> device, we would be be able to exclude that there are secrete channels
>> in the device's interface having the same effect.
> 
> I'm not sure I agree here.  If there are secret channels to the device
> that let it violate the PCI express spec, it can probably break the SRIOV
> security model. And then you can do much more than just crash the host.

Maybe, but there are also other devices. And if a guest reprograms it
(firmware update...) and makes it stop reacting on requests, we may get
the same effect. That would also be some kind of a "secrete channel".

> 
>> So we likely need to
>> enhance PCI error handling to catch and handle faults for certain
>> devices differently - those we cannot trust to behave properly while
>> they are under userland/guest control.
>>
>> Jan
>>
> 
> 
> I agree - forwarding errors to the guest would actually be very useful - but
> I think we also need to analyse the problem carefully,
> and prevent as many ways as we can for guest to cause trouble.

If possible, the protection should target userspace which would
automatically include guests. Only if that is not feasible with
reasonable effort, we have to rely on QEMU to save the host.

> 
> And there is another issue here: unsuppported request errors
> should not cause kernel panics IMO.
> 
> There's also the issue that qemu let guest control the MMIO/PIO
> bits in the command register.
> 
> So there are multiple bugs.
> 

Yep, that's true.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 18:39                                   ` Jan Kiszka
@ 2012-04-08 20:35                                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-08 20:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, Alex Williamson, kvm, jbaron

On Sun, Apr 08, 2012 at 08:39:35PM +0200, Jan Kiszka wrote:
> On 2012-04-08 20:18, Michael S. Tsirkin wrote:
> > On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
> >> On 2012-04-08 18:08, Avi Kivity wrote:
> >>> On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
> >>>> On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
> >>>>> On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>
> >>>>>>>>> I'm thinking about this flow:
> >>>>>>>>>
> >>>>>>>>>   FLR the device
> >>>>>>>>>   for each emulated register
> >>>>>>>>>      read it from the hardware
> >>>>>>>>>      if different from emulated register:
> >>>>>>>>>         update the internal model (for example, disabling MSI in kvm if
> >>>>>>>>> needed)
> >>>>>>>>
> >>>>>>>> If we do it this way we get back the problem this patch
> >>>>>>>> is trying to solve: MSIX assigned while device
> >>>>>>>> memory is disabled would cause unsupported request errors.
> >>>>>>>
> >>>>>>> Why is that?  FLR would presumably disable MSI in the device, and this
> >>>>>>> line would disable it in kvm as well.
> >>>>>>
> >>>>>> The bug is that device memory is disabled (FLR would do that)
> >>>>>> while MSI is enabled in kvm. The fix is to
> >>>>>> disable MSI in kvm first.
> >>>>>
> >>>>> Yes, no need to repeat.  My question is whether my pseudo-code does the
> >>>>> same
> >>>>
> >>>> It doesn't seem to: FLR (disabling memory) is followed
> >>>> by MSI disable in kvm instead of the reverse.
> >>>
> >>> Ah, so the problem is the ordering?  I see.
> >>>
> >>>>> and whether or not if it is better (when applied to all emulated
> >>>>> config space).
> >>>>
> >>>> I'm not sure.
> >>>> I would like to see an example of a register that you have
> >>>> in mind.
> >>>
> >>> I went over the PCI registers and saw none that would be affected.
> >>>
> >>>>>>
> >>>>>> Yes. I'm talking about things like enabling memory, setting up irq register,
> >>>>>> etc though. Most of this setup is done by bios.
> >>>>>
> >>>>> I see.  So should we have a pci_reset_function() variant that limits
> >>>>> itself to restoring just those bits?
> >>>>
> >>>> We only need kernel to restore whatever qemu emulates, but
> >>>> kernel doesn't know what that is.
> >>>> What kind of interface do you have in mind?
> >>>>
> >>>
> >>> The same as pci_reset_function(), but leaves MSI clear.
> >>>
> >>> I guess it's not worth it if the ordering problem is there.
> >>
> >> The core problem is not the ordering. The problem is that the kernel is
> >> susceptible to ordering mistakes of userspace.
> >> And that is because the
> >> kernel panics on PCI errors of devices that are in user hands - a
> >> critical kernel bug IMHO.
> > 
> > I'm not sure. The pci sysfs interface
> > is by design not secured against malicious users,
> > isn't it?
> 
> That's surely true for devices outside of IOMMU protection. But do we
> really have to give up when we encapsulate and isolate them that way?
> Provided we moderate access to the sysfs resources via libvirt or some
> other management service.

We don't have to give up but we'd have to build such an
interface: /config attribute is not it.

> > 
> >> Proper reset of MSI or even the whole PCI
> >> config space is another issue, but one the kernel should not worry about
> >> - still, it should be fixed (therefore this patch).
> >> But even if we disallowed userland to disable MMIO and PIO access to the
> >> device, we would be be able to exclude that there are secrete channels
> >> in the device's interface having the same effect.
> > 
> > I'm not sure I agree here.  If there are secret channels to the device
> > that let it violate the PCI express spec, it can probably break the SRIOV
> > security model. And then you can do much more than just crash the host.
> 
> Maybe, but there are also other devices. And if a guest reprograms it
> (firmware update...) and makes it stop reacting on requests, we may get
> the same effect. That would also be some kind of a "secrete channel".

Right. So it looks like SRIOV VF is the only type of device that
is safe to assign to a guest:

Presumably, SRIOV VFs don't let driver program the firmware.
And I think SRIOV VFs don't have MMIO/PIO enable bits either,
and the BAR isn't programmable through the VF...

> > 
> >> So we likely need to
> >> enhance PCI error handling to catch and handle faults for certain
> >> devices differently - those we cannot trust to behave properly while
> >> they are under userland/guest control.
> >>
> >> Jan
> >>
> > 
> > 
> > I agree - forwarding errors to the guest would actually be very useful - but
> > I think we also need to analyse the problem carefully,
> > and prevent as many ways as we can for guest to cause trouble.
> 
> If possible, the protection should target userspace which would
> automatically include guests. Only if that is not feasible with
> reasonable effort, we have to rely on QEMU to save the host.

Defence in depth is best, right?

> > 
> > And there is another issue here: unsuppported request errors
> > should not cause kernel panics IMO.
> > 
> > There's also the issue that qemu let guest control the MMIO/PIO
> > bits in the command register.
> > 
> > So there are multiple bugs.
> > 
> 
> Yep, that's true.
> 
> Jan
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-08 17:37                               ` Jan Kiszka
  2012-04-08 18:18                                 ` Michael S. Tsirkin
@ 2012-04-09  8:35                                 ` Avi Kivity
  2012-04-10 16:55                                   ` Alex Williamson
  1 sibling, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-04-09  8:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Michael S. Tsirkin, Alex Williamson, kvm, jbaron

On 04/08/2012 08:37 PM, Jan Kiszka wrote:
> The core problem is not the ordering. The problem is that the kernel is
> susceptible to ordering mistakes of userspace. And that is because the
> kernel panics on PCI errors of devices that are in user hands - a
> critical kernel bug IMHO. 

Certainly.  But this userspace patch won't fix it.

> Proper reset of MSI or even the whole PCI
> config space is another issue, but one the kernel should not worry about
> - still, it should be fixed (therefore this patch).

And I was asking what is the right way to do it.  Reset the device and
read back the register values, or do an emulated reset and push down the
register values.

> But even if we disallowed userland to disable MMIO and PIO access to the
> device, we would be be able to exclude that there are secrete channels
> in the device's interface having the same effect. So we likely need to
> enhance PCI error handling to catch and handle faults for certain
> devices differently - those we cannot trust to behave properly while
> they are under userland/guest control.

Why not all of them?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-09  8:35                                 ` Avi Kivity
@ 2012-04-10 16:55                                   ` Alex Williamson
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Williamson @ 2012-04-10 16:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, Michael S. Tsirkin, kvm, jbaron

On Mon, 2012-04-09 at 11:35 +0300, Avi Kivity wrote:
> On 04/08/2012 08:37 PM, Jan Kiszka wrote:
> > The core problem is not the ordering. The problem is that the kernel is
> > susceptible to ordering mistakes of userspace. And that is because the
> > kernel panics on PCI errors of devices that are in user hands - a
> > critical kernel bug IMHO. 
> 
> Certainly.  But this userspace patch won't fix it.

No, it won't in general and I don't think it makes sense to mangle
pci-sysfs config space support to the nuances of a user space driver.
We really need a userspace driver interface that limits the config space
interactions and provides a channel to support error reporting and
userspace recovery.  This type of thing can be done with VFIO if we
could ever get off the ground and get some consensus around it.  Please
feel free to contribute to that discussion if you ever want to get away
from this clunky device assignment interface we have now.

> > Proper reset of MSI or even the whole PCI
> > config space is another issue, but one the kernel should not worry about
> > - still, it should be fixed (therefore this patch).
> 
> And I was asking what is the right way to do it.  Reset the device and
> read back the register values, or do an emulated reset and push down the
> register values.

Reading back the register values is currently a noop since the kernel
restores the config space to the incoming state after reset.  KVM does
stash away the original config space of the device to be restored prior
to releasing the device.  We could restore to that each time, but that
would mean implementing a device reset ioctl in kvm, and we'd still need
this patch for compatibility and we still have the issues Michael brings
up with the config restore updating things like MSI that we need to then
manually sync with kvm.  I fear suggesting it, but perhaps another way
to achieve this result would be to de-assign and re-assign the device in
reset.

> > But even if we disallowed userland to disable MMIO and PIO access to the
> > device, we would be be able to exclude that there are secrete channels
> > in the device's interface having the same effect. So we likely need to
> > enhance PCI error handling to catch and handle faults for certain
> > devices differently - those we cannot trust to behave properly while
> > they are under userland/guest control.
> 
> Why not all of them?

I think Jan is probably suggesting that we do need user space error
handling for all userland/guest controlled devices, but some classes of
errors on certain devices may be handled automatically by the userspace
interface layer... which we could do with VFIO (well, assuming the APEI
spec let's us nak the bios reporting a fatal error).  So do we want to
invent new solutions for each of these or do we want to move to a new
interface?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-05  3:42 [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path Alex Williamson
                   ` (2 preceding siblings ...)
  2012-04-08 13:14 ` Avi Kivity
@ 2012-04-16 14:03 ` Alex Williamson
  2012-04-16 14:31   ` Avi Kivity
  2012-04-16 15:06   ` Michael S. Tsirkin
  2012-04-17  0:55 ` Marcelo Tosatti
  4 siblings, 2 replies; 41+ messages in thread
From: Alex Williamson @ 2012-04-16 14:03 UTC (permalink / raw)
  To: kvm, Avi Kivity, Marcelo Tosatti; +Cc: jan.kiszka, mst, jbaron

The discussion on this patch seems to have fizzled, with no clear short
term solution.  Jan gave a Reviewed-by and Michael an Acked-by for this
patch.  There's work left to do, but I request that we pull in this fix
now.  Thanks,

Alex

On Wed, 2012-04-04 at 21:42 -0600, Alex Williamson wrote:
> We've hit a kernel host panic, when issuing a 'system_reset' with an
> 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
> 
> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> [Hardware Error]: APEI generic hardware error status
> [Hardware Error]: severity: 1, fatal
> [Hardware Error]: section: 0, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> [Hardware Error]: section: 1, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> Kernel panic - not syncing: Fatal hardware error!
> Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> Call Trace:
>  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
>  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
>  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
>  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
>  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
>  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
>  [<ffffffff814f6760>] ? nmi+0x20/0x30
>  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
>  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
>  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
>  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
>  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
>  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
>  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
> 
> The root cause of the problem is that the 'reset_assigned_device()' code
> first writes a 0 to the command register. Then, when qemu subsequently does
> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> the memory mapped msi vector space. Since, we've explicitly told the device
> to disallow mmio access (via the 0 write to the command register), we end
> up with the above 'Unsupported Request'.
> 
> The fix here is to first disable MSI-X, before doing the reset.  We also
> disable MSI, leaving the device in INTx mode.  In this way, the device is
> a known state after reset, and we avoid touching msi memory mapped space
> on any subsequent 'kvm_deassign_irq()'.
> 
> Thanks to Michael S. Tsirkin for help in understanding what was going on
> here and Jason Baron, the original debugger of this problem.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Jason is out of the office for a couple weeks, so I'll try to resolve
> this while he's away.  Somehow the emulated config updates were lost
> in Jason's original posting, so I've fixed that and taken Jan's suggestion
> to simply call into the update functions instead of open coding the
> interrupt disable.  I think there still may be some disagreements about
> how to handle guest generated errors in the host, but that's a large
> project whereas this is something we should be doing at reset anyway,
> and even if only a workaround, resolves the problem above.
> 
>  hw/device-assignment.c |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..2e6b93e 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
>      const char reset[] = "1";
>      int fd, ret;
>  
> +    /*
> +     * If a guest is reset without being shutdown, MSI/MSI-X can still
> +     * be running.  We want to return the device to a known state on
> +     * reset, so disable those here.  We especially do not want MSI-X
> +     * enabled since it lives in MMIO space, which is about to get
> +     * disabled.
> +     */
> +    if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSIX) {
> +        uint16_t ctrl = pci_get_word(pci_dev->config +
> +                                     pci_dev->msix_cap + PCI_MSIX_FLAGS);
> +
> +        pci_set_word(pci_dev->config + pci_dev->msix_cap + PCI_MSIX_FLAGS,
> +                     ctrl & ~PCI_MSIX_FLAGS_ENABLE);
> +        assigned_dev_update_msix(pci_dev);
> +    } else if (adev->irq_requested_type & KVM_DEV_IRQ_GUEST_MSI) {
> +        uint8_t ctrl = pci_get_byte(pci_dev->config +
> +                                    pci_dev->msi_cap + PCI_MSI_FLAGS);
> +
> +        pci_set_byte(pci_dev->config + pci_dev->msi_cap + PCI_MSI_FLAGS,
> +                     ctrl & ~PCI_MSI_FLAGS_ENABLE);
> +        assigned_dev_update_msi(pci_dev);
> +    }
> +
>      snprintf(reset_file, sizeof(reset_file),
>               "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/reset",
>               adev->host.seg, adev->host.bus, adev->host.dev, adev->host.func);
> 




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 14:03 ` Alex Williamson
@ 2012-04-16 14:31   ` Avi Kivity
  2012-04-16 15:06   ` Michael S. Tsirkin
  1 sibling, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-04-16 14:31 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Marcelo Tosatti, jan.kiszka, mst, jbaron

On 04/16/2012 05:03 PM, Alex Williamson wrote:
> The discussion on this patch seems to have fizzled, with no clear short
> term solution.  Jan gave a Reviewed-by and Michael an Acked-by for this
> patch.  There's work left to do, but I request that we pull in this fix
> now.  Thanks,
>

I agree (but am not committing this week).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 14:03 ` Alex Williamson
  2012-04-16 14:31   ` Avi Kivity
@ 2012-04-16 15:06   ` Michael S. Tsirkin
  2012-04-16 15:10     ` Jan Kiszka
                       ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 15:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Avi Kivity, Marcelo Tosatti, jan.kiszka, jbaron

On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
> The discussion on this patch seems to have fizzled, with no clear short
> term solution.

I think we are in concensus, it's just that there are
multiple bugs still left to fix.

First, we need to prevent guest from touching command
register except for the bus master bit. Something like
the below? Compiled only.

device-assignment: don't touch pci command register

Real command register is under kernel control:
it includes bits for triggering SERR, marking
BARs as invalid and such which are under host
kernel control. Don't touch any except bus master
which is ok to put under guest control.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---
diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 89823f1..9ebce49 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg,
     FILE *f;
     unsigned long long start, end, size, flags;
     uint16_t id;
-    struct stat statbuf;
     PCIRegion *rp;
     PCIDevRegions *dev = &pci_dev->real_device;
 
@@ -610,12 +609,8 @@ again:
     pci_dev->dev.config[2] = id & 0xff;
     pci_dev->dev.config[3] = (id & 0xff00) >> 8;
 
-    /* dealing with virtual function device */
-    snprintf(name, sizeof(name), "%sphysfn/", dir);
-    if (!stat(name, &statbuf)) {
-        /* always provide the written value on readout */
-        assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
-    }
+    /* Pass bus master writes to device. */
+    pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER;
 
     dev->region_number = r;
     return 0;
@@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
                 "cause host memory corruption if the device issues DMA write "
                 "requests!\n");
     }
-    if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
-        kvm_has_intx_set_mask()) {
-        assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
-
-        /* hide host-side INTx masking from the guest */
-        dev->emulate_config_read[PCI_COMMAND + 1] |=
-            PCI_COMMAND_INTX_DISABLE >> 8;
-    }
 
     r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
     if (r < 0) {
@@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev)
     }
 
     /*
-     * When a 0 is written to the command register, the device is logically
+     * When a 0 is written to the bus master register, the device is logically
      * disconnected from the PCI bus. This avoids further DMA transfers.
      */
-    assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
+    assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
 }
 
 static int assigned_initfn(struct PCIDevice *pci_dev)
@@ -1658,7 +1645,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
      * device initialization.
      */
     assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
-    assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
     assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
     assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
     assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 15:06   ` Michael S. Tsirkin
@ 2012-04-16 15:10     ` Jan Kiszka
  2012-04-16 16:08       ` Michael S. Tsirkin
  2012-04-16 16:12     ` Jason Baron
  2012-04-16 19:07     ` Alex Williamson
  2 siblings, 1 reply; 41+ messages in thread
From: Jan Kiszka @ 2012-04-16 15:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jbaron

On 2012-04-16 17:06, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
>> The discussion on this patch seems to have fizzled, with no clear short
>> term solution.
> 
> I think we are in concensus, it's just that there are
> multiple bugs still left to fix.
> 
> First, we need to prevent guest from touching command
> register except for the bus master bit. Something like

INTx is harmless as well. Denying access breaks masking for the guest.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 15:10     ` Jan Kiszka
@ 2012-04-16 16:08       ` Michael S. Tsirkin
  2012-04-16 16:13         ` Jan Kiszka
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 16:08 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jbaron

On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
> On 2012-04-16 17:06, Michael S. Tsirkin wrote:
> > On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
> >> The discussion on this patch seems to have fizzled, with no clear short
> >> term solution.
> > 
> > I think we are in concensus, it's just that there are
> > multiple bugs still left to fix.
> > 
> > First, we need to prevent guest from touching command
> > register except for the bus master bit. Something like
> 
> INTx is harmless as well. Denying access breaks masking for the guest.

We currently mark intx as emulated, no?
Anyway - care fixing it properly?

> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 15:06   ` Michael S. Tsirkin
  2012-04-16 15:10     ` Jan Kiszka
@ 2012-04-16 16:12     ` Jason Baron
  2012-04-16 16:34       ` Michael S. Tsirkin
  2012-04-16 19:07     ` Alex Williamson
  2 siblings, 1 reply; 41+ messages in thread
From: Jason Baron @ 2012-04-16 16:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jan.kiszka

On Mon, Apr 16, 2012 at 06:06:40PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
> > The discussion on this patch seems to have fizzled, with no clear short
> > term solution.
> 
> I think we are in concensus, it's just that there are
> multiple bugs still left to fix.
> 
> First, we need to prevent guest from touching command
> register except for the bus master bit. Something like
> the below? Compiled only.
> 
> device-assignment: don't touch pci command register
> 
> Real command register is under kernel control:
> it includes bits for triggering SERR, marking
> BARs as invalid and such which are under host
> kernel control. Don't touch any except bus master
> which is ok to put under guest control.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..9ebce49 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg,
>      FILE *f;
>      unsigned long long start, end, size, flags;
>      uint16_t id;
> -    struct stat statbuf;
>      PCIRegion *rp;
>      PCIDevRegions *dev = &pci_dev->real_device;
>  
> @@ -610,12 +609,8 @@ again:
>      pci_dev->dev.config[2] = id & 0xff;
>      pci_dev->dev.config[3] = (id & 0xff00) >> 8;
>  
> -    /* dealing with virtual function device */
> -    snprintf(name, sizeof(name), "%sphysfn/", dir);
> -    if (!stat(name, &statbuf)) {
> -        /* always provide the written value on readout */
> -        assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
> -    }
> +    /* Pass bus master writes to device. */
> +    pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER;
>  
>      dev->region_number = r;
>      return 0;
> @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
>                  "cause host memory corruption if the device issues DMA write "
>                  "requests!\n");
>      }
> -    if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
> -        kvm_has_intx_set_mask()) {
> -        assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
> -
> -        /* hide host-side INTx masking from the guest */
> -        dev->emulate_config_read[PCI_COMMAND + 1] |=
> -            PCI_COMMAND_INTX_DISABLE >> 8;
> -    }
>  
>      r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
>      if (r < 0) {
> @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev)
>      }
>  
>      /*
> -     * When a 0 is written to the command register, the device is logically
> +     * When a 0 is written to the bus master register, the device is logically
>       * disconnected from the PCI bus. This avoids further DMA transfers.
>       */
> -    assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
> +    assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
>  }

This is still going to disable mmio, is the intent to just clear the bus
master bit, ie bit 2? Or is this patch meant to be in addition to the
one Alex posted?

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 16:08       ` Michael S. Tsirkin
@ 2012-04-16 16:13         ` Jan Kiszka
  2012-04-16 16:36           ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kiszka @ 2012-04-16 16:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jbaron

On 2012-04-16 18:08, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
>> On 2012-04-16 17:06, Michael S. Tsirkin wrote:
>>> On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
>>>> The discussion on this patch seems to have fizzled, with no clear short
>>>> term solution.
>>>
>>> I think we are in concensus, it's just that there are
>>> multiple bugs still left to fix.
>>>
>>> First, we need to prevent guest from touching command
>>> register except for the bus master bit. Something like
>>
>> INTx is harmless as well. Denying access breaks masking for the guest.
> 
> We currently mark intx as emulated, no?

If we use the mask also in the kernel for IRQ sharing, we *monitor* it.

> Anyway - care fixing it properly?

There is nothing to fix.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 16:12     ` Jason Baron
@ 2012-04-16 16:34       ` Michael S. Tsirkin
  0 siblings, 0 replies; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 16:34 UTC (permalink / raw)
  To: Jason Baron; +Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jan.kiszka

On Mon, Apr 16, 2012 at 12:12:52PM -0400, Jason Baron wrote:
> On Mon, Apr 16, 2012 at 06:06:40PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
> > > The discussion on this patch seems to have fizzled, with no clear short
> > > term solution.
> > 
> > I think we are in concensus, it's just that there are
> > multiple bugs still left to fix.
> > 
> > First, we need to prevent guest from touching command
> > register except for the bus master bit. Something like
> > the below? Compiled only.
> > 
> > device-assignment: don't touch pci command register
> > 
> > Real command register is under kernel control:
> > it includes bits for triggering SERR, marking
> > BARs as invalid and such which are under host
> > kernel control. Don't touch any except bus master
> > which is ok to put under guest control.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> > index 89823f1..9ebce49 100644
> > --- a/hw/device-assignment.c
> > +++ b/hw/device-assignment.c
> > @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg,
> >      FILE *f;
> >      unsigned long long start, end, size, flags;
> >      uint16_t id;
> > -    struct stat statbuf;
> >      PCIRegion *rp;
> >      PCIDevRegions *dev = &pci_dev->real_device;
> >  
> > @@ -610,12 +609,8 @@ again:
> >      pci_dev->dev.config[2] = id & 0xff;
> >      pci_dev->dev.config[3] = (id & 0xff00) >> 8;
> >  
> > -    /* dealing with virtual function device */
> > -    snprintf(name, sizeof(name), "%sphysfn/", dir);
> > -    if (!stat(name, &statbuf)) {
> > -        /* always provide the written value on readout */
> > -        assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
> > -    }
> > +    /* Pass bus master writes to device. */
> > +    pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER;
> >  
> >      dev->region_number = r;
> >      return 0;
> > @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
> >                  "cause host memory corruption if the device issues DMA write "
> >                  "requests!\n");
> >      }
> > -    if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
> > -        kvm_has_intx_set_mask()) {
> > -        assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
> > -
> > -        /* hide host-side INTx masking from the guest */
> > -        dev->emulate_config_read[PCI_COMMAND + 1] |=
> > -            PCI_COMMAND_INTX_DISABLE >> 8;
> > -    }
> >  
> >      r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
> >      if (r < 0) {
> > @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev)
> >      }
> >  
> >      /*
> > -     * When a 0 is written to the command register, the device is logically
> > +     * When a 0 is written to the bus master register, the device is logically
> >       * disconnected from the PCI bus. This avoids further DMA transfers.
> >       */
> > -    assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
> > +    assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
> >  }
> 
> This is still going to disable mmio,

I think it won't since all other bits are marked as emulated now.

> is the intent to just clear the bus
> master bit, ie bit 2? Or is this patch meant to be in addition to the
> one Alex posted?
> 
> Thanks,
> 
> -Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 16:13         ` Jan Kiszka
@ 2012-04-16 16:36           ` Michael S. Tsirkin
  2012-04-16 16:38             ` Jan Kiszka
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 16:36 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jbaron

On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote:
> On 2012-04-16 18:08, Michael S. Tsirkin wrote:
> > On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
> >> On 2012-04-16 17:06, Michael S. Tsirkin wrote:
> >>> On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
> >>>> The discussion on this patch seems to have fizzled, with no clear short
> >>>> term solution.
> >>>
> >>> I think we are in concensus, it's just that there are
> >>> multiple bugs still left to fix.
> >>>
> >>> First, we need to prevent guest from touching command
> >>> register except for the bus master bit. Something like
> >>
> >> INTx is harmless as well. Denying access breaks masking for the guest.
> > 
> > We currently mark intx as emulated, no?
> 
> If we use the mask also in the kernel for IRQ sharing, we *monitor* it.
> 
> > Anyway - care fixing it properly?
> 
> There is nothing to fix.
> 
> Jan

Guest access  to command register is a bug.
E.g. it has no business enabling SERR as SERR crashes
host when triggered.

> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 16:36           ` Michael S. Tsirkin
@ 2012-04-16 16:38             ` Jan Kiszka
  2012-04-16 17:12               ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kiszka @ 2012-04-16 16:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jbaron

On 2012-04-16 18:36, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote:
>> On 2012-04-16 18:08, Michael S. Tsirkin wrote:
>>> On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
>>>> On 2012-04-16 17:06, Michael S. Tsirkin wrote:
>>>>> On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
>>>>>> The discussion on this patch seems to have fizzled, with no clear short
>>>>>> term solution.
>>>>>
>>>>> I think we are in concensus, it's just that there are
>>>>> multiple bugs still left to fix.
>>>>>
>>>>> First, we need to prevent guest from touching command
>>>>> register except for the bus master bit. Something like
>>>>
>>>> INTx is harmless as well. Denying access breaks masking for the guest.
>>>
>>> We currently mark intx as emulated, no?
>>
>> If we use the mask also in the kernel for IRQ sharing, we *monitor* it.
>>
>>> Anyway - care fixing it properly?
>>
>> There is nothing to fix.
>>
>> Jan
> 
> Guest access  to command register is a bug.
> E.g. it has no business enabling SERR as SERR crashes
> host when triggered.

Maybe a misunderstanding. I agree that other bits are not guest
business. But INTx should be excluded from this filtering and handled as
is, i.e. passed through.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 16:38             ` Jan Kiszka
@ 2012-04-16 17:12               ` Michael S. Tsirkin
  2012-04-16 18:47                 ` Jan Kiszka
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 17:12 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jbaron

On Mon, Apr 16, 2012 at 06:38:22PM +0200, Jan Kiszka wrote:
> On 2012-04-16 18:36, Michael S. Tsirkin wrote:
> > On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote:
> >> On 2012-04-16 18:08, Michael S. Tsirkin wrote:
> >>> On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
> >>>> On 2012-04-16 17:06, Michael S. Tsirkin wrote:
> >>>>> On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
> >>>>>> The discussion on this patch seems to have fizzled, with no clear short
> >>>>>> term solution.
> >>>>>
> >>>>> I think we are in concensus, it's just that there are
> >>>>> multiple bugs still left to fix.
> >>>>>
> >>>>> First, we need to prevent guest from touching command
> >>>>> register except for the bus master bit. Something like
> >>>>
> >>>> INTx is harmless as well. Denying access breaks masking for the guest.
> >>>
> >>> We currently mark intx as emulated, no?
> >>
> >> If we use the mask also in the kernel for IRQ sharing, we *monitor* it.
> >>
> >>> Anyway - care fixing it properly?
> >>
> >> There is nothing to fix.
> >>
> >> Jan
> > 
> > Guest access  to command register is a bug.
> > E.g. it has no business enabling SERR as SERR crashes
> > host when triggered.
> 
> Maybe a misunderstanding. I agree that other bits are not guest
> business. But INTx should be excluded from this filtering and handled as
> is, i.e. passed through.
> 
> Jan

Ah, I see what you mean.
So let's add
    pci_dev->emulate_config_write[PCI_COMMAND + 1] &= ~(PCI_COMMAND_INTX_DISABLE >> 8);
on top?


> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 17:12               ` Michael S. Tsirkin
@ 2012-04-16 18:47                 ` Jan Kiszka
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Kiszka @ 2012-04-16 18:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, kvm, Avi Kivity, Marcelo Tosatti, jbaron

On 2012-04-16 19:12, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 06:38:22PM +0200, Jan Kiszka wrote:
>> On 2012-04-16 18:36, Michael S. Tsirkin wrote:
>>> On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote:
>>>> On 2012-04-16 18:08, Michael S. Tsirkin wrote:
>>>>> On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
>>>>>> On 2012-04-16 17:06, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
>>>>>>>> The discussion on this patch seems to have fizzled, with no clear short
>>>>>>>> term solution.
>>>>>>>
>>>>>>> I think we are in concensus, it's just that there are
>>>>>>> multiple bugs still left to fix.
>>>>>>>
>>>>>>> First, we need to prevent guest from touching command
>>>>>>> register except for the bus master bit. Something like
>>>>>>
>>>>>> INTx is harmless as well. Denying access breaks masking for the guest.
>>>>>
>>>>> We currently mark intx as emulated, no?
>>>>
>>>> If we use the mask also in the kernel for IRQ sharing, we *monitor* it.
>>>>
>>>>> Anyway - care fixing it properly?
>>>>
>>>> There is nothing to fix.
>>>>
>>>> Jan
>>>
>>> Guest access  to command register is a bug.
>>> E.g. it has no business enabling SERR as SERR crashes
>>> host when triggered.
>>
>> Maybe a misunderstanding. I agree that other bits are not guest
>> business. But INTx should be excluded from this filtering and handled as
>> is, i.e. passed through.
>>
>> Jan
> 
> Ah, I see what you mean.
> So let's add
>     pci_dev->emulate_config_write[PCI_COMMAND + 1] &= ~(PCI_COMMAND_INTX_DISABLE >> 8);
> on top?

Looks good.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 15:06   ` Michael S. Tsirkin
  2012-04-16 15:10     ` Jan Kiszka
  2012-04-16 16:12     ` Jason Baron
@ 2012-04-16 19:07     ` Alex Williamson
  2012-04-16 19:47       ` Michael S. Tsirkin
  2 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2012-04-16 19:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, Avi Kivity, Marcelo Tosatti, jan.kiszka, jbaron

On Mon, 2012-04-16 at 18:06 +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
> > The discussion on this patch seems to have fizzled, with no clear short
> > term solution.
> 
> I think we are in concensus, it's just that there are
> multiple bugs still left to fix.
> 
> First, we need to prevent guest from touching command
> register except for the bus master bit. Something like
> the below? Compiled only.
> 
> device-assignment: don't touch pci command register
> 
> Real command register is under kernel control:
> it includes bits for triggering SERR, marking
> BARs as invalid and such which are under host
> kernel control. Don't touch any except bus master
> which is ok to put under guest control.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..9ebce49 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg,
>      FILE *f;
>      unsigned long long start, end, size, flags;
>      uint16_t id;
> -    struct stat statbuf;
>      PCIRegion *rp;
>      PCIDevRegions *dev = &pci_dev->real_device;
>  
> @@ -610,12 +609,8 @@ again:
>      pci_dev->dev.config[2] = id & 0xff;
>      pci_dev->dev.config[3] = (id & 0xff00) >> 8;
>  
> -    /* dealing with virtual function device */
> -    snprintf(name, sizeof(name), "%sphysfn/", dir);
> -    if (!stat(name, &statbuf)) {
> -        /* always provide the written value on readout */
> -        assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
> -    }
> +    /* Pass bus master writes to device. */
> +    pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER;
>  
>      dev->region_number = r;
>      return 0;
> @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
>                  "cause host memory corruption if the device issues DMA write "
>                  "requests!\n");
>      }
> -    if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
> -        kvm_has_intx_set_mask()) {
> -        assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;

This doesn't look right ^^^, we'll never make use of host side INTx
disable support that way.  Thanks,

Alex

> -
> -        /* hide host-side INTx masking from the guest */
> -        dev->emulate_config_read[PCI_COMMAND + 1] |=
> -            PCI_COMMAND_INTX_DISABLE >> 8;
> -    }
>  
>      r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
>      if (r < 0) {
> @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev)
>      }
>  
>      /*
> -     * When a 0 is written to the command register, the device is logically
> +     * When a 0 is written to the bus master register, the device is logically
>       * disconnected from the PCI bus. This avoids further DMA transfers.
>       */
> -    assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
> +    assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
>  }
>  
>  static int assigned_initfn(struct PCIDevice *pci_dev)
> @@ -1658,7 +1645,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
>       * device initialization.
>       */
>      assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
> -    assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
>      assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
>      assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
>      assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-16 19:07     ` Alex Williamson
@ 2012-04-16 19:47       ` Michael S. Tsirkin
  0 siblings, 0 replies; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 19:47 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Avi Kivity, Marcelo Tosatti, jan.kiszka, jbaron

On Mon, Apr 16, 2012 at 01:07:56PM -0600, Alex Williamson wrote:
> On Mon, 2012-04-16 at 18:06 +0300, Michael S. Tsirkin wrote:
> > On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
> > > The discussion on this patch seems to have fizzled, with no clear short
> > > term solution.
> > 
> > I think we are in concensus, it's just that there are
> > multiple bugs still left to fix.
> > 
> > First, we need to prevent guest from touching command
> > register except for the bus master bit. Something like
> > the below? Compiled only.
> > 
> > device-assignment: don't touch pci command register
> > 
> > Real command register is under kernel control:
> > it includes bits for triggering SERR, marking
> > BARs as invalid and such which are under host
> > kernel control. Don't touch any except bus master
> > which is ok to put under guest control.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> > index 89823f1..9ebce49 100644
> > --- a/hw/device-assignment.c
> > +++ b/hw/device-assignment.c
> > @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg,
> >      FILE *f;
> >      unsigned long long start, end, size, flags;
> >      uint16_t id;
> > -    struct stat statbuf;
> >      PCIRegion *rp;
> >      PCIDevRegions *dev = &pci_dev->real_device;
> >  
> > @@ -610,12 +609,8 @@ again:
> >      pci_dev->dev.config[2] = id & 0xff;
> >      pci_dev->dev.config[3] = (id & 0xff00) >> 8;
> >  
> > -    /* dealing with virtual function device */
> > -    snprintf(name, sizeof(name), "%sphysfn/", dir);
> > -    if (!stat(name, &statbuf)) {
> > -        /* always provide the written value on readout */
> > -        assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
> > -    }
> > +    /* Pass bus master writes to device. */
> > +    pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER;
> >  
> >      dev->region_number = r;
> >      return 0;
> > @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
> >                  "cause host memory corruption if the device issues DMA write "
> >                  "requests!\n");
> >      }
> > -    if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
> > -        kvm_has_intx_set_mask()) {
> > -        assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
> 
> This doesn't look right ^^^, we'll never make use of host side INTx
> disable support that way.  Thanks,
> 
> Alex

Right.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
  2012-04-05  3:42 [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path Alex Williamson
                   ` (3 preceding siblings ...)
  2012-04-16 14:03 ` Alex Williamson
@ 2012-04-17  0:55 ` Marcelo Tosatti
  4 siblings, 0 replies; 41+ messages in thread
From: Marcelo Tosatti @ 2012-04-17  0:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, jan.kiszka, mst, jbaron

On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote:
> We've hit a kernel host panic, when issuing a 'system_reset' with an
> 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
> 
> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> [Hardware Error]: APEI generic hardware error status
> [Hardware Error]: severity: 1, fatal
> [Hardware Error]: section: 0, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> [Hardware Error]: section: 1, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> Kernel panic - not syncing: Fatal hardware error!
> Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> Call Trace:
>  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
>  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
>  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
>  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
>  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
>  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
>  [<ffffffff814f6760>] ? nmi+0x20/0x30
>  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
>  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
>  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
>  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
>  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
>  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
>  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
> 
> The root cause of the problem is that the 'reset_assigned_device()' code
> first writes a 0 to the command register. Then, when qemu subsequently does
> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> the memory mapped msi vector space. Since, we've explicitly told the device
> to disallow mmio access (via the 0 write to the command register), we end
> up with the above 'Unsupported Request'.
> 
> The fix here is to first disable MSI-X, before doing the reset.  We also
> disable MSI, leaving the device in INTx mode.  In this way, the device is
> a known state after reset, and we avoid touching msi memory mapped space
> on any subsequent 'kvm_deassign_irq()'.
> 
> Thanks to Michael S. Tsirkin for help in understanding what was going on
> here and Jason Baron, the original debugger of this problem.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Applied, thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2012-04-17  3:05 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-05  3:42 [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path Alex Williamson
2012-04-05  7:28 ` Jan Kiszka
2012-04-05  9:34 ` Michael S. Tsirkin
2012-04-05 14:42   ` Alex Williamson
2012-04-05 15:04     ` Michael S. Tsirkin
2012-04-08 13:14 ` Avi Kivity
2012-04-08 13:17   ` Michael S. Tsirkin
2012-04-08 13:18     ` Avi Kivity
2012-04-08 13:21       ` Michael S. Tsirkin
2012-04-08 13:24         ` Avi Kivity
2012-04-08 13:30           ` Michael S. Tsirkin
2012-04-08 13:41             ` Avi Kivity
2012-04-08 13:53               ` Michael S. Tsirkin
2012-04-08 14:01                 ` Avi Kivity
2012-04-08 14:42                   ` Michael S. Tsirkin
2012-04-08 15:26                     ` Avi Kivity
2012-04-08 15:46                       ` Michael S. Tsirkin
2012-04-08 15:50                         ` Avi Kivity
2012-04-08 16:04                           ` Michael S. Tsirkin
2012-04-08 16:08                             ` Avi Kivity
2012-04-08 17:37                               ` Jan Kiszka
2012-04-08 18:18                                 ` Michael S. Tsirkin
2012-04-08 18:39                                   ` Jan Kiszka
2012-04-08 20:35                                     ` Michael S. Tsirkin
2012-04-09  8:35                                 ` Avi Kivity
2012-04-10 16:55                                   ` Alex Williamson
2012-04-16 14:03 ` Alex Williamson
2012-04-16 14:31   ` Avi Kivity
2012-04-16 15:06   ` Michael S. Tsirkin
2012-04-16 15:10     ` Jan Kiszka
2012-04-16 16:08       ` Michael S. Tsirkin
2012-04-16 16:13         ` Jan Kiszka
2012-04-16 16:36           ` Michael S. Tsirkin
2012-04-16 16:38             ` Jan Kiszka
2012-04-16 17:12               ` Michael S. Tsirkin
2012-04-16 18:47                 ` Jan Kiszka
2012-04-16 16:12     ` Jason Baron
2012-04-16 16:34       ` Michael S. Tsirkin
2012-04-16 19:07     ` Alex Williamson
2012-04-16 19:47       ` Michael S. Tsirkin
2012-04-17  0:55 ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.