On 2012-03-30 21:18, Jason Baron wrote: > We've hit a kernel host panic, when issuing a 'system_reset' with a 82576 nic > assigned and a Windows guest. Host system is a PowerEdge R815. > > [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 > [Hardware Error]: APEI generic hardware error status > [Hardware Error]: severity: 1, fatal > [Hardware Error]: section: 0, severity: 1, fatal > [Hardware Error]: flags: 0x01 > [Hardware Error]: primary > [Hardware Error]: section_type: PCIe error > [Hardware Error]: port_type: 0, PCIe end point > [Hardware Error]: version: 1.0 > [Hardware Error]: command: 0x0000, status: 0x0010 > [Hardware Error]: device_id: 0000:08:00.0 > [Hardware Error]: slot: 1 > [Hardware Error]: secondary_bus: 0x00 > [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 > [Hardware Error]: class_code: 000002 > [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000 > [Hardware Error]: Unsupported Request > [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID > [Hardware Error]: aer_uncor_severity: 0x00067011 > [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000 > [Hardware Error]: section: 1, severity: 1, fatal > [Hardware Error]: flags: 0x01 > [Hardware Error]: primary > [Hardware Error]: section_type: PCIe error > [Hardware Error]: port_type: 0, PCIe end point > [Hardware Error]: version: 1.0 > [Hardware Error]: command: 0x0000, status: 0x0010 > [Hardware Error]: device_id: 0000:08:00.0 > [Hardware Error]: slot: 1 > [Hardware Error]: secondary_bus: 0x00 > [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 > [Hardware Error]: class_code: 000002 > [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000 > [Hardware Error]: Unsupported Request > [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID > [Hardware Error]: aer_uncor_severity: 0x00067011 > [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000 > Kernel panic - not syncing: Fatal hardware error! > Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 > Call Trace: > [] ? panic+0xa0/0x168 > [] ? ghes_notify_nmi+0x17c/0x180 > [] ? notifier_call_chain+0x55/0x80 > [] ? atomic_notifier_call_chain+0x1a/0x20 > [] ? notify_die+0x2e/0x30 > [] ? do_nmi+0x1a1/0x2b0 > [] ? nmi+0x20/0x30 > [] ? native_safe_halt+0xb/0x10 > <> [] ? default_idle+0x4d/0xb0 > [] ? cpu_idle+0xb6/0x110 > [] ? rest_init+0x7a/0x80 > [] ? start_kernel+0x424/0x430 > [] ? x86_64_start_reservations+0x125/0x129 > [] ? x86_64_start_kernel+0xfa/0x109 > > The root cause of the problem is that the 'reset_assigned_device()' code > first writes a 0 to the command register. Then, when qemu subsequently does > a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), > the kernel ends up calling '__msix_mask_irq()', which performs a write to > the memory mapped msi vector space. Since, we've explicitly told the device > to disallow mmio access (via the 0 write to the command register), we end > up with the above 'Unsupported Request'. > > The fix here is to first call kvm_deassign_irq(), before doing the reset, s/fix/workaround/. This is a kernel bug if userspace can crash the system like this, no? Let's fix the kernel first and then look at what needs to be changed here. Jan > and then calling assign_irq() to put the device in an INTx mode. In this > way, the device is a known state after reset (INTx mode), and we avoid touching > msi memory mapped space on any subsequent 'kvm_deassign_irq()', since we're > in INTx mode. > > Thanks to Michael S. Tsirkin for help in understanding what was going on here. > > Signed-off-by: Jason Baron > Signed-off-by: Alex Williamson > --- > hw/device-assignment.c | 27 +++++++++++++++++++++++++++ > 1 files changed, 27 insertions(+), 0 deletions(-) > > diff --git a/hw/device-assignment.c b/hw/device-assignment.c > index 89823f1..31aed17 100644 > --- a/hw/device-assignment.c > +++ b/hw/device-assignment.c > @@ -1609,10 +1609,32 @@ static void reset_assigned_device(DeviceState *dev) > { > PCIDevice *pci_dev = DO_UPCAST(PCIDevice, qdev, dev); > AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev); > + struct kvm_assigned_irq assigned_irq_data; > char reset_file[64]; > const char reset[] = "1"; > int fd, ret; > > + /* > + * Make sure the irq for the device is set to a consistent state of INTx > + * on reset. This also ensures that a subsequent deassign_irq/assign_irq > + * sequence (such as during 'system_reset'), does not touch memory > + * mapped msi space, since we are about to disallow that access via a > + * 0 write to the command register. In addition, the 'kvm_deassign_irq()' > + * clears the msi enable bit, thus preventing any unexpected MSIs. > + */ > + memset(&assigned_irq_data, 0, sizeof assigned_irq_data); > + assigned_irq_data.assigned_dev_id = > + calc_assigned_dev_id(adev); > + assigned_irq_data.flags = adev->irq_requested_type; > + free_dev_irq_entries(adev); > + ret = kvm_deassign_irq(kvm_state, &assigned_irq_data); > + /* -ENXIO means no assigned irq */ > + if (ret && ret != -ENXIO) { > + perror("reset_assigned_device: deassign irq"); > + } > + > + adev->irq_requested_type = 0; > + > snprintf(reset_file, sizeof(reset_file), > "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/reset", > adev->host.seg, adev->host.bus, adev->host.dev, adev->host.func); > @@ -1635,6 +1657,11 @@ static void reset_assigned_device(DeviceState *dev) > * disconnected from the PCI bus. This avoids further DMA transfers. > */ > assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); > + > + ret = assign_irq(adev); > + if (ret) { > + perror("reset_assigned_device: assign irq"); > + } > } > > static int assigned_initfn(struct PCIDevice *pci_dev)