All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] VMD endpoint passthrough support
@ 2020-05-11 19:01 ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, virtualization,
	Christoph Hellwig, Andrzej Jakowski, Jon Derrick

This set contains 2 patches for Linux and 1 for QEMU. VMD device
8086:28C0 contains information in registers to assist with direct
assignment passthrough. Several other VMD devices don't have this
information, but can easily be emulated to offer this feature.

The existing VMD devices not supporting the feature cannot be changed to
offer the information, but also don't restrict the ability to offer this
information in emulation by the hypervisor. Future VMD devices will
offer the 28C0 mode natively.

The QEMU patch emulates the hardware assistance that the VMD 28C0 device
provides: a config space register claiming passthrough support, and the
shadow membar registers containing the host information for guest
address assignment in the VMD domain. These VMD devices have this config
space register set as reserved and will not conflict with the emulated
bit.

The Linux patch allows guest kernels to use the passthrough information
emulated by the QEMU patch, by matching the config space register
claiming passthrough support.

Changes from v1:
v1 changed the VMD Subsystem ID to QEMU's so that the guest driver could
match against it. This was unnecessary as the VMLOCK register and shadow
membar registers could be safely emulated. Future VMDs will be aligned
on these register bits.

Added the resource bit filtering patch that got lost in the mailserver.

v1: https://lore.kernel.org/linux-pci/20200422171444.10992-1-jonathan.derrick@intel.com/

Jon Derrick (2):
  PCI: vmd: Filter resource type bits from shadow register
  PCI: vmd: Use Shadow MEMBAR registers for QEMU/KVM guests

 drivers/pci/controller/vmd.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 0/2] VMD endpoint passthrough support
@ 2020-05-11 19:01 ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Lorenzo Pieralisi, virtualization, Andrzej Jakowski,
	Bjorn Helgaas, Christoph Hellwig, Jon Derrick

This set contains 2 patches for Linux and 1 for QEMU. VMD device
8086:28C0 contains information in registers to assist with direct
assignment passthrough. Several other VMD devices don't have this
information, but can easily be emulated to offer this feature.

The existing VMD devices not supporting the feature cannot be changed to
offer the information, but also don't restrict the ability to offer this
information in emulation by the hypervisor. Future VMD devices will
offer the 28C0 mode natively.

The QEMU patch emulates the hardware assistance that the VMD 28C0 device
provides: a config space register claiming passthrough support, and the
shadow membar registers containing the host information for guest
address assignment in the VMD domain. These VMD devices have this config
space register set as reserved and will not conflict with the emulated
bit.

The Linux patch allows guest kernels to use the passthrough information
emulated by the QEMU patch, by matching the config space register
claiming passthrough support.

Changes from v1:
v1 changed the VMD Subsystem ID to QEMU's so that the guest driver could
match against it. This was unnecessary as the VMLOCK register and shadow
membar registers could be safely emulated. Future VMDs will be aligned
on these register bits.

Added the resource bit filtering patch that got lost in the mailserver.

v1: https://lore.kernel.org/linux-pci/20200422171444.10992-1-jonathan.derrick@intel.com/

Jon Derrick (2):
  PCI: vmd: Filter resource type bits from shadow register
  PCI: vmd: Use Shadow MEMBAR registers for QEMU/KVM guests

 drivers/pci/controller/vmd.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

-- 
2.18.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 0/2] VMD endpoint passthrough support
@ 2020-05-11 19:01 ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, virtualization,
	Christoph Hellwig, Andrzej Jakowski, Jon Derrick

This set contains 2 patches for Linux and 1 for QEMU. VMD device
8086:28C0 contains information in registers to assist with direct
assignment passthrough. Several other VMD devices don't have this
information, but can easily be emulated to offer this feature.

The existing VMD devices not supporting the feature cannot be changed to
offer the information, but also don't restrict the ability to offer this
information in emulation by the hypervisor. Future VMD devices will
offer the 28C0 mode natively.

The QEMU patch emulates the hardware assistance that the VMD 28C0 device
provides: a config space register claiming passthrough support, and the
shadow membar registers containing the host information for guest
address assignment in the VMD domain. These VMD devices have this config
space register set as reserved and will not conflict with the emulated
bit.

The Linux patch allows guest kernels to use the passthrough information
emulated by the QEMU patch, by matching the config space register
claiming passthrough support.

Changes from v1:
v1 changed the VMD Subsystem ID to QEMU's so that the guest driver could
match against it. This was unnecessary as the VMLOCK register and shadow
membar registers could be safely emulated. Future VMDs will be aligned
on these register bits.

Added the resource bit filtering patch that got lost in the mailserver.

v1: https://lore.kernel.org/linux-pci/20200422171444.10992-1-jonathan.derrick@intel.com/

Jon Derrick (2):
  PCI: vmd: Filter resource type bits from shadow register
  PCI: vmd: Use Shadow MEMBAR registers for QEMU/KVM guests

 drivers/pci/controller/vmd.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

-- 
2.18.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
  2020-05-11 19:01 ` Jon Derrick
  (?)
@ 2020-05-11 19:01   ` Jon Derrick
  -1 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, virtualization,
	Christoph Hellwig, Andrzej Jakowski, Jon Derrick

The VMD endpoint provides a real PCIe domain to the guest, including
bridges and endpoints. Because the VMD domain is enumerated by the guest
kernel, the guest kernel will assign Guest Physical Addresses to the
downstream endpoint BARs and bridge windows.

When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
translate from the guest address space to the physical address space.
Because the bridges have been programmed with guest addresses, the
bridges will reject the transaction containing physical addresses.

VMD device 28C0 natively assists passthrough by providing the Host
Physical Address in shadow registers accessible to the guest for bridge
window assignment. The shadow registers are valid if bit 1 is set in VMD
VMLOCK config register 0x70. Future VMDs will also support this feature.
Existing VMDs have config register 0x70 reserved, and will return 0 on
reads.

In order to support existing VMDs, this quirk emulates the VMLOCK and
HPA shadow registers for all VMD device ids which don't natively assist
with passthrough. The Linux VMD driver is updated to allow existing VMD
devices to query VMLOCK for passthrough support.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c        |   7 +++
 hw/vfio/pci.h        |   2 +
 hw/vfio/trace-events |   3 ++
 4 files changed, 115 insertions(+)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 2d348f8237..4060a6a95d 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -1709,3 +1709,106 @@ free_exit:
 
     return ret;
 }
+
+/*
+ * The VMD endpoint provides a real PCIe domain to the guest and the guest
+ * kernel performs enumeration of the VMD sub-device domain. Guest transactions
+ * to VMD sub-devices go through IOMMU translation from guest addresses to
+ * physical addresses. When MMIO goes to an endpoint after being translated to
+ * physical addresses, the bridge rejects the transaction because the window
+ * has been programmed with guest addresses.
+ *
+ * VMD can use the Host Physical Address in order to correctly program the
+ * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
+ * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
+ * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
+ * this native assistance can have these registers safely emulated as these
+ * registers are reserved.
+ */
+typedef struct VFIOVMDQuirk {
+    VFIOPCIDevice *vdev;
+    uint64_t membar_phys[2];
+} VFIOVMDQuirk;
+
+static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
+{
+    VFIOVMDQuirk *data = opaque;
+    uint64_t val = 0;
+
+    memcpy(&val, (void *)data->membar_phys + addr, size);
+    return val;
+}
+
+static const MemoryRegionOps vfio_vmd_quirk = {
+    .read = vfio_vmd_quirk_read,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+#define VMD_VMLOCK  0x70
+#define VMD_SHADOW  0x2000
+#define VMD_MEMBAR2 4
+
+static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
+{
+    VFIOQuirk *quirk;
+    VFIOVMDQuirk *data;
+    PCIDevice *pdev = &vdev->pdev;
+    int ret;
+
+    data = g_malloc0(sizeof(*data));
+    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
+                vdev->config_offset + PCI_BASE_ADDRESS_2);
+    if (ret != 16) {
+        error_report("VMD %s cannot read MEMBARs (%d)",
+                     vdev->vbasedev.name, ret);
+        g_free(data);
+        return -EFAULT;
+    }
+
+    quirk = vfio_quirk_alloc(1);
+    quirk->data = data;
+    data->vdev = vdev;
+
+    /* Emulate Shadow Registers */
+    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
+                          "vfio-vmd-quirk", sizeof(data->membar_phys));
+    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
+                                        VMD_SHADOW, quirk->mem, 1);
+    memory_region_set_readonly(quirk->mem, true);
+    memory_region_set_enabled(quirk->mem, true);
+
+    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
+
+    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
+                                         data->membar_phys[0],
+                                         data->membar_phys[1]);
+
+    /* Advertise Shadow Register support */
+    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
+    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
+    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
+
+    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
+                                    pci_get_byte(pdev->config + VMD_VMLOCK));
+
+    return 0;
+}
+
+int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
+{
+    int ret = 0;
+
+    switch (vdev->device_id) {
+    case 0x28C0: /* Native passthrough support */
+        break;
+    /* Emulates Native passthrough support */
+    case 0x201D:
+    case 0x467F:
+    case 0x4C3D:
+    case 0x9A0B:
+        ret = vfio_vmd_emulate_shadow_registers(vdev);
+        break;
+    }
+
+    return ret;
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129..85425a1a6f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
+        ret = vfio_pci_vmd_init(vdev);
+        if (ret) {
+            error_report("Failed to setup VMD");
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 0da7a20a7e..e8632d806b 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
 int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
 int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
 
+int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
+
 void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
 void vfio_display_finalize(VFIOPCIDevice *vdev);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33f..ed64e477db 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
 vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
 vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
 
+vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
+vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
+
 # common.c
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
-- 
2.18.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-11 19:01   ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Lorenzo Pieralisi, virtualization, Andrzej Jakowski,
	Bjorn Helgaas, Christoph Hellwig, Jon Derrick

The VMD endpoint provides a real PCIe domain to the guest, including
bridges and endpoints. Because the VMD domain is enumerated by the guest
kernel, the guest kernel will assign Guest Physical Addresses to the
downstream endpoint BARs and bridge windows.

When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
translate from the guest address space to the physical address space.
Because the bridges have been programmed with guest addresses, the
bridges will reject the transaction containing physical addresses.

VMD device 28C0 natively assists passthrough by providing the Host
Physical Address in shadow registers accessible to the guest for bridge
window assignment. The shadow registers are valid if bit 1 is set in VMD
VMLOCK config register 0x70. Future VMDs will also support this feature.
Existing VMDs have config register 0x70 reserved, and will return 0 on
reads.

In order to support existing VMDs, this quirk emulates the VMLOCK and
HPA shadow registers for all VMD device ids which don't natively assist
with passthrough. The Linux VMD driver is updated to allow existing VMD
devices to query VMLOCK for passthrough support.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c        |   7 +++
 hw/vfio/pci.h        |   2 +
 hw/vfio/trace-events |   3 ++
 4 files changed, 115 insertions(+)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 2d348f8237..4060a6a95d 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -1709,3 +1709,106 @@ free_exit:
 
     return ret;
 }
+
+/*
+ * The VMD endpoint provides a real PCIe domain to the guest and the guest
+ * kernel performs enumeration of the VMD sub-device domain. Guest transactions
+ * to VMD sub-devices go through IOMMU translation from guest addresses to
+ * physical addresses. When MMIO goes to an endpoint after being translated to
+ * physical addresses, the bridge rejects the transaction because the window
+ * has been programmed with guest addresses.
+ *
+ * VMD can use the Host Physical Address in order to correctly program the
+ * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
+ * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
+ * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
+ * this native assistance can have these registers safely emulated as these
+ * registers are reserved.
+ */
+typedef struct VFIOVMDQuirk {
+    VFIOPCIDevice *vdev;
+    uint64_t membar_phys[2];
+} VFIOVMDQuirk;
+
+static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
+{
+    VFIOVMDQuirk *data = opaque;
+    uint64_t val = 0;
+
+    memcpy(&val, (void *)data->membar_phys + addr, size);
+    return val;
+}
+
+static const MemoryRegionOps vfio_vmd_quirk = {
+    .read = vfio_vmd_quirk_read,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+#define VMD_VMLOCK  0x70
+#define VMD_SHADOW  0x2000
+#define VMD_MEMBAR2 4
+
+static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
+{
+    VFIOQuirk *quirk;
+    VFIOVMDQuirk *data;
+    PCIDevice *pdev = &vdev->pdev;
+    int ret;
+
+    data = g_malloc0(sizeof(*data));
+    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
+                vdev->config_offset + PCI_BASE_ADDRESS_2);
+    if (ret != 16) {
+        error_report("VMD %s cannot read MEMBARs (%d)",
+                     vdev->vbasedev.name, ret);
+        g_free(data);
+        return -EFAULT;
+    }
+
+    quirk = vfio_quirk_alloc(1);
+    quirk->data = data;
+    data->vdev = vdev;
+
+    /* Emulate Shadow Registers */
+    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
+                          "vfio-vmd-quirk", sizeof(data->membar_phys));
+    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
+                                        VMD_SHADOW, quirk->mem, 1);
+    memory_region_set_readonly(quirk->mem, true);
+    memory_region_set_enabled(quirk->mem, true);
+
+    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
+
+    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
+                                         data->membar_phys[0],
+                                         data->membar_phys[1]);
+
+    /* Advertise Shadow Register support */
+    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
+    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
+    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
+
+    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
+                                    pci_get_byte(pdev->config + VMD_VMLOCK));
+
+    return 0;
+}
+
+int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
+{
+    int ret = 0;
+
+    switch (vdev->device_id) {
+    case 0x28C0: /* Native passthrough support */
+        break;
+    /* Emulates Native passthrough support */
+    case 0x201D:
+    case 0x467F:
+    case 0x4C3D:
+    case 0x9A0B:
+        ret = vfio_vmd_emulate_shadow_registers(vdev);
+        break;
+    }
+
+    return ret;
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129..85425a1a6f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
+        ret = vfio_pci_vmd_init(vdev);
+        if (ret) {
+            error_report("Failed to setup VMD");
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 0da7a20a7e..e8632d806b 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
 int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
 int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
 
+int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
+
 void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
 void vfio_display_finalize(VFIOPCIDevice *vdev);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33f..ed64e477db 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
 vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
 vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
 
+vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
+vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
+
 # common.c
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
-- 
2.18.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-11 19:01   ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, virtualization,
	Christoph Hellwig, Andrzej Jakowski, Jon Derrick

The VMD endpoint provides a real PCIe domain to the guest, including
bridges and endpoints. Because the VMD domain is enumerated by the guest
kernel, the guest kernel will assign Guest Physical Addresses to the
downstream endpoint BARs and bridge windows.

When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
translate from the guest address space to the physical address space.
Because the bridges have been programmed with guest addresses, the
bridges will reject the transaction containing physical addresses.

VMD device 28C0 natively assists passthrough by providing the Host
Physical Address in shadow registers accessible to the guest for bridge
window assignment. The shadow registers are valid if bit 1 is set in VMD
VMLOCK config register 0x70. Future VMDs will also support this feature.
Existing VMDs have config register 0x70 reserved, and will return 0 on
reads.

In order to support existing VMDs, this quirk emulates the VMLOCK and
HPA shadow registers for all VMD device ids which don't natively assist
with passthrough. The Linux VMD driver is updated to allow existing VMD
devices to query VMLOCK for passthrough support.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c        |   7 +++
 hw/vfio/pci.h        |   2 +
 hw/vfio/trace-events |   3 ++
 4 files changed, 115 insertions(+)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 2d348f8237..4060a6a95d 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -1709,3 +1709,106 @@ free_exit:
 
     return ret;
 }
+
+/*
+ * The VMD endpoint provides a real PCIe domain to the guest and the guest
+ * kernel performs enumeration of the VMD sub-device domain. Guest transactions
+ * to VMD sub-devices go through IOMMU translation from guest addresses to
+ * physical addresses. When MMIO goes to an endpoint after being translated to
+ * physical addresses, the bridge rejects the transaction because the window
+ * has been programmed with guest addresses.
+ *
+ * VMD can use the Host Physical Address in order to correctly program the
+ * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
+ * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
+ * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
+ * this native assistance can have these registers safely emulated as these
+ * registers are reserved.
+ */
+typedef struct VFIOVMDQuirk {
+    VFIOPCIDevice *vdev;
+    uint64_t membar_phys[2];
+} VFIOVMDQuirk;
+
+static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
+{
+    VFIOVMDQuirk *data = opaque;
+    uint64_t val = 0;
+
+    memcpy(&val, (void *)data->membar_phys + addr, size);
+    return val;
+}
+
+static const MemoryRegionOps vfio_vmd_quirk = {
+    .read = vfio_vmd_quirk_read,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+#define VMD_VMLOCK  0x70
+#define VMD_SHADOW  0x2000
+#define VMD_MEMBAR2 4
+
+static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
+{
+    VFIOQuirk *quirk;
+    VFIOVMDQuirk *data;
+    PCIDevice *pdev = &vdev->pdev;
+    int ret;
+
+    data = g_malloc0(sizeof(*data));
+    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
+                vdev->config_offset + PCI_BASE_ADDRESS_2);
+    if (ret != 16) {
+        error_report("VMD %s cannot read MEMBARs (%d)",
+                     vdev->vbasedev.name, ret);
+        g_free(data);
+        return -EFAULT;
+    }
+
+    quirk = vfio_quirk_alloc(1);
+    quirk->data = data;
+    data->vdev = vdev;
+
+    /* Emulate Shadow Registers */
+    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
+                          "vfio-vmd-quirk", sizeof(data->membar_phys));
+    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
+                                        VMD_SHADOW, quirk->mem, 1);
+    memory_region_set_readonly(quirk->mem, true);
+    memory_region_set_enabled(quirk->mem, true);
+
+    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
+
+    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
+                                         data->membar_phys[0],
+                                         data->membar_phys[1]);
+
+    /* Advertise Shadow Register support */
+    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
+    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
+    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
+
+    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
+                                    pci_get_byte(pdev->config + VMD_VMLOCK));
+
+    return 0;
+}
+
+int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
+{
+    int ret = 0;
+
+    switch (vdev->device_id) {
+    case 0x28C0: /* Native passthrough support */
+        break;
+    /* Emulates Native passthrough support */
+    case 0x201D:
+    case 0x467F:
+    case 0x4C3D:
+    case 0x9A0B:
+        ret = vfio_vmd_emulate_shadow_registers(vdev);
+        break;
+    }
+
+    return ret;
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129..85425a1a6f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
+        ret = vfio_pci_vmd_init(vdev);
+        if (ret) {
+            error_report("Failed to setup VMD");
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 0da7a20a7e..e8632d806b 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
 int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
 int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
 
+int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
+
 void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
 void vfio_display_finalize(VFIOPCIDevice *vdev);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33f..ed64e477db 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
 vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
 vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
 
+vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
+vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
+
 # common.c
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
-- 
2.18.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/2] PCI: vmd: Filter resource type bits from shadow register
  2020-05-11 19:01 ` Jon Derrick
  (?)
@ 2020-05-11 19:01   ` Jon Derrick
  -1 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, virtualization,
	Christoph Hellwig, Andrzej Jakowski, Jon Derrick

Versions of VMD with the Host Physical Address shadow register use this
register to calculate the bus address offset needed to do guest
passthrough of the domain. This register shadows the Host Physical
Address registers including the resource type bits. After calculating
the offset, the extra resource type bits lead to the VMD resources being
over-provisioned at the front and under-provisioned at the back.

Example:
pci 10000:80:02.0: reg 0x10: [mem 0xf801fffc-0xf803fffb 64bit]

Expected:
pci 10000:80:02.0: reg 0x10: [mem 0xf8020000-0xf803ffff 64bit]

If other devices are mapped in the over-provisioned front, it could lead
to resource conflict issues with VMD or those devices.

Fixes: a1a30170138c9 ("PCI: vmd: Fix shadow offsets to reflect spec changes")
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index dac91d60701d..e386d4eac407 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -445,9 +445,11 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 			if (!membar2)
 				return -ENOMEM;
 			offset[0] = vmd->dev->resource[VMD_MEMBAR1].start -
-					readq(membar2 + MB2_SHADOW_OFFSET);
+					(readq(membar2 + MB2_SHADOW_OFFSET) &
+					 PCI_BASE_ADDRESS_MEM_MASK);
 			offset[1] = vmd->dev->resource[VMD_MEMBAR2].start -
-					readq(membar2 + MB2_SHADOW_OFFSET + 8);
+					(readq(membar2 + MB2_SHADOW_OFFSET + 8) &
+					 PCI_BASE_ADDRESS_MEM_MASK);
 			pci_iounmap(vmd->dev, membar2);
 		}
 	}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/2] PCI: vmd: Filter resource type bits from shadow register
@ 2020-05-11 19:01   ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Lorenzo Pieralisi, virtualization, Andrzej Jakowski,
	Bjorn Helgaas, Christoph Hellwig, Jon Derrick

Versions of VMD with the Host Physical Address shadow register use this
register to calculate the bus address offset needed to do guest
passthrough of the domain. This register shadows the Host Physical
Address registers including the resource type bits. After calculating
the offset, the extra resource type bits lead to the VMD resources being
over-provisioned at the front and under-provisioned at the back.

Example:
pci 10000:80:02.0: reg 0x10: [mem 0xf801fffc-0xf803fffb 64bit]

Expected:
pci 10000:80:02.0: reg 0x10: [mem 0xf8020000-0xf803ffff 64bit]

If other devices are mapped in the over-provisioned front, it could lead
to resource conflict issues with VMD or those devices.

Fixes: a1a30170138c9 ("PCI: vmd: Fix shadow offsets to reflect spec changes")
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index dac91d60701d..e386d4eac407 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -445,9 +445,11 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 			if (!membar2)
 				return -ENOMEM;
 			offset[0] = vmd->dev->resource[VMD_MEMBAR1].start -
-					readq(membar2 + MB2_SHADOW_OFFSET);
+					(readq(membar2 + MB2_SHADOW_OFFSET) &
+					 PCI_BASE_ADDRESS_MEM_MASK);
 			offset[1] = vmd->dev->resource[VMD_MEMBAR2].start -
-					readq(membar2 + MB2_SHADOW_OFFSET + 8);
+					(readq(membar2 + MB2_SHADOW_OFFSET + 8) &
+					 PCI_BASE_ADDRESS_MEM_MASK);
 			pci_iounmap(vmd->dev, membar2);
 		}
 	}
-- 
2.18.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/2] PCI: vmd: Filter resource type bits from shadow register
@ 2020-05-11 19:01   ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, virtualization,
	Christoph Hellwig, Andrzej Jakowski, Jon Derrick

Versions of VMD with the Host Physical Address shadow register use this
register to calculate the bus address offset needed to do guest
passthrough of the domain. This register shadows the Host Physical
Address registers including the resource type bits. After calculating
the offset, the extra resource type bits lead to the VMD resources being
over-provisioned at the front and under-provisioned at the back.

Example:
pci 10000:80:02.0: reg 0x10: [mem 0xf801fffc-0xf803fffb 64bit]

Expected:
pci 10000:80:02.0: reg 0x10: [mem 0xf8020000-0xf803ffff 64bit]

If other devices are mapped in the over-provisioned front, it could lead
to resource conflict issues with VMD or those devices.

Fixes: a1a30170138c9 ("PCI: vmd: Fix shadow offsets to reflect spec changes")
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index dac91d60701d..e386d4eac407 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -445,9 +445,11 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 			if (!membar2)
 				return -ENOMEM;
 			offset[0] = vmd->dev->resource[VMD_MEMBAR1].start -
-					readq(membar2 + MB2_SHADOW_OFFSET);
+					(readq(membar2 + MB2_SHADOW_OFFSET) &
+					 PCI_BASE_ADDRESS_MEM_MASK);
 			offset[1] = vmd->dev->resource[VMD_MEMBAR2].start -
-					readq(membar2 + MB2_SHADOW_OFFSET + 8);
+					(readq(membar2 + MB2_SHADOW_OFFSET + 8) &
+					 PCI_BASE_ADDRESS_MEM_MASK);
 			pci_iounmap(vmd->dev, membar2);
 		}
 	}
-- 
2.18.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 2/2] PCI: vmd: Use Shadow MEMBAR registers for QEMU/KVM guests
  2020-05-11 19:01 ` Jon Derrick
  (?)
@ 2020-05-11 19:01   ` Jon Derrick
  -1 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, virtualization,
	Christoph Hellwig, Andrzej Jakowski, Jon Derrick

VMD device 28C0 natively assists guest passthrough of the VMD endpoint
through the use of shadow registers that provide Host Physical Addresses
to correctly assign bridge windows. These shadow registers are only
available if VMD config space register 0x70, bit 1 is set.

For existing VMD which don't natively support the shadow register, VMD
config space register 0x70 is reserved and will return 0. Future VMD
will have these registers natively in hardware, but existing VMD can
still use this feature by emulating the config space register and shadow
registers.

QEMU has been modified to emulate this config space register and the
shadow membar registers for VMDs which don't natively support this
feature. This patch updates the supported device list to allow this
feature to be used on these VMDs.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index e386d4eac407..ee71d0989875 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -600,6 +600,7 @@ static irqreturn_t vmd_irq(int irq, void *data)
 static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	struct vmd_dev *vmd;
+	unsigned long features = id->driver_data;
 	int i, err;
 
 	if (resource_size(&dev->resource[VMD_CFGBAR]) < (1 << 20))
@@ -652,7 +653,7 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 
 	spin_lock_init(&vmd->cfg_lock);
 	pci_set_drvdata(dev, vmd);
-	err = vmd_enable_domain(vmd, (unsigned long) id->driver_data);
+	err = vmd_enable_domain(vmd, features);
 	if (err)
 		return err;
 
@@ -716,16 +717,20 @@ static int vmd_resume(struct device *dev)
 static SIMPLE_DEV_PM_OPS(vmd_dev_pm_ops, vmd_suspend, vmd_resume);
 
 static const struct pci_device_id vmd_ids[] = {
-	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_201D),},
+	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_201D),
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
 		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
 				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x4c3d),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_9A0B),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{0,}
 };
 MODULE_DEVICE_TABLE(pci, vmd_ids);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 2/2] PCI: vmd: Use Shadow MEMBAR registers for QEMU/KVM guests
@ 2020-05-11 19:01   ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Lorenzo Pieralisi, virtualization, Andrzej Jakowski,
	Bjorn Helgaas, Christoph Hellwig, Jon Derrick

VMD device 28C0 natively assists guest passthrough of the VMD endpoint
through the use of shadow registers that provide Host Physical Addresses
to correctly assign bridge windows. These shadow registers are only
available if VMD config space register 0x70, bit 1 is set.

For existing VMD which don't natively support the shadow register, VMD
config space register 0x70 is reserved and will return 0. Future VMD
will have these registers natively in hardware, but existing VMD can
still use this feature by emulating the config space register and shadow
registers.

QEMU has been modified to emulate this config space register and the
shadow membar registers for VMDs which don't natively support this
feature. This patch updates the supported device list to allow this
feature to be used on these VMDs.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index e386d4eac407..ee71d0989875 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -600,6 +600,7 @@ static irqreturn_t vmd_irq(int irq, void *data)
 static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	struct vmd_dev *vmd;
+	unsigned long features = id->driver_data;
 	int i, err;
 
 	if (resource_size(&dev->resource[VMD_CFGBAR]) < (1 << 20))
@@ -652,7 +653,7 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 
 	spin_lock_init(&vmd->cfg_lock);
 	pci_set_drvdata(dev, vmd);
-	err = vmd_enable_domain(vmd, (unsigned long) id->driver_data);
+	err = vmd_enable_domain(vmd, features);
 	if (err)
 		return err;
 
@@ -716,16 +717,20 @@ static int vmd_resume(struct device *dev)
 static SIMPLE_DEV_PM_OPS(vmd_dev_pm_ops, vmd_suspend, vmd_resume);
 
 static const struct pci_device_id vmd_ids[] = {
-	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_201D),},
+	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_201D),
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
 		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
 				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x4c3d),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_9A0B),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{0,}
 };
 MODULE_DEVICE_TABLE(pci, vmd_ids);
-- 
2.18.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 2/2] PCI: vmd: Use Shadow MEMBAR registers for QEMU/KVM guests
@ 2020-05-11 19:01   ` Jon Derrick
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Derrick @ 2020-05-11 19:01 UTC (permalink / raw)
  To: linux-pci, qemu-devel
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, virtualization,
	Christoph Hellwig, Andrzej Jakowski, Jon Derrick

VMD device 28C0 natively assists guest passthrough of the VMD endpoint
through the use of shadow registers that provide Host Physical Addresses
to correctly assign bridge windows. These shadow registers are only
available if VMD config space register 0x70, bit 1 is set.

For existing VMD which don't natively support the shadow register, VMD
config space register 0x70 is reserved and will return 0. Future VMD
will have these registers natively in hardware, but existing VMD can
still use this feature by emulating the config space register and shadow
registers.

QEMU has been modified to emulate this config space register and the
shadow membar registers for VMDs which don't natively support this
feature. This patch updates the supported device list to allow this
feature to be used on these VMDs.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index e386d4eac407..ee71d0989875 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -600,6 +600,7 @@ static irqreturn_t vmd_irq(int irq, void *data)
 static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	struct vmd_dev *vmd;
+	unsigned long features = id->driver_data;
 	int i, err;
 
 	if (resource_size(&dev->resource[VMD_CFGBAR]) < (1 << 20))
@@ -652,7 +653,7 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 
 	spin_lock_init(&vmd->cfg_lock);
 	pci_set_drvdata(dev, vmd);
-	err = vmd_enable_domain(vmd, (unsigned long) id->driver_data);
+	err = vmd_enable_domain(vmd, features);
 	if (err)
 		return err;
 
@@ -716,16 +717,20 @@ static int vmd_resume(struct device *dev)
 static SIMPLE_DEV_PM_OPS(vmd_dev_pm_ops, vmd_suspend, vmd_resume);
 
 static const struct pci_device_id vmd_ids[] = {
-	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_201D),},
+	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_201D),
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
 		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
 				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x4c3d),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_9A0B),
-		.driver_data = VMD_FEAT_HAS_BUS_RESTRICTIONS,},
+		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{0,}
 };
 MODULE_DEVICE_TABLE(pci, vmd_ids);
-- 
2.18.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
  2020-05-11 19:01   ` Jon Derrick
  (?)
@ 2020-05-11 22:59     ` Alex Williamson
  -1 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-11 22:59 UTC (permalink / raw)
  To: Jon Derrick
  Cc: linux-pci, qemu-devel, Lorenzo Pieralisi, virtualization,
	Andrzej Jakowski, Bjorn Helgaas, Christoph Hellwig

On Mon, 11 May 2020 15:01:27 -0400
Jon Derrick <jonathan.derrick@intel.com> wrote:

> The VMD endpoint provides a real PCIe domain to the guest, including

Please define VMD.  I'm sure this is obvious to many, but I've had to
do some research.  The best TL;DR summary I've found is Keith's
original commit 185a383ada2e adding the controller to Linux.  If there's
something better, please let me know.

> bridges and endpoints. Because the VMD domain is enumerated by the guest
> kernel, the guest kernel will assign Guest Physical Addresses to the
> downstream endpoint BARs and bridge windows.
>
> When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
> translate from the guest address space to the physical address space.
> Because the bridges have been programmed with guest addresses, the
> bridges will reject the transaction containing physical addresses.

I'm lost, what IOMMU is involved in CPU access to MMIO space?  My guess
is that since all MMIO of this domain is mapped behind the host
endpoint BARs 2 & 4 that QEMU simply accesses it via mapping of those
BARs into the VM, so it's the MMU, not the IOMMU performing those GPA
to HPA translations.  But then presumably the bridges within the domain
are scrambled because their apertures are programmed with ranges that
don't map into the VMD endpoint BARs.  Is that remotely correct?  Some
/proc/iomem output and/or lspci listing from the host to see how this
works would be useful.

> VMD device 28C0 natively assists passthrough by providing the Host
> Physical Address in shadow registers accessible to the guest for bridge
> window assignment. The shadow registers are valid if bit 1 is set in VMD
> VMLOCK config register 0x70. Future VMDs will also support this feature.
> Existing VMDs have config register 0x70 reserved, and will return 0 on
> reads.

So these shadow registers are simply exposing the host BAR2 & BAR4
addresses into the guest, so the quirk is dependent on reading those
values from the device before anyone has written to them and the BAR
emulation in the kernel kicks in (not a problem, just an observation).

Does the VMD controller code then use these bases addresses to program
the bridges/endpoint within the domain?  What does the same /proc/iomem
or lspci look like inside the guest then?  It seems like we'd see the
VMD endpoint with GPA BARs, but the devices within the domain using
HPAs.  If that's remotely true, and we're not forcing an identity
mapping of this HPA range into the GPA, does the vmd controller driver
impose a TRA function on these MMIO addresses in the guest?

Sorry if I'm way off, I'm piecing things together from scant
information here.  Please Cc me on future vfio related patches.  Thanks,

Alex

 
> In order to support existing VMDs, this quirk emulates the VMLOCK and
> HPA shadow registers for all VMD device ids which don't natively assist
> with passthrough. The Linux VMD driver is updated to allow existing VMD
> devices to query VMLOCK for passthrough support.
> 
> Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> ---
>  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c        |   7 +++
>  hw/vfio/pci.h        |   2 +
>  hw/vfio/trace-events |   3 ++
>  4 files changed, 115 insertions(+)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index 2d348f8237..4060a6a95d 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -1709,3 +1709,106 @@ free_exit:
>  
>      return ret;
>  }
> +
> +/*
> + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> + * to VMD sub-devices go through IOMMU translation from guest addresses to
> + * physical addresses. When MMIO goes to an endpoint after being translated to
> + * physical addresses, the bridge rejects the transaction because the window
> + * has been programmed with guest addresses.
> + *
> + * VMD can use the Host Physical Address in order to correctly program the
> + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> + * this native assistance can have these registers safely emulated as these
> + * registers are reserved.
> + */
> +typedef struct VFIOVMDQuirk {
> +    VFIOPCIDevice *vdev;
> +    uint64_t membar_phys[2];
> +} VFIOVMDQuirk;
> +
> +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    VFIOVMDQuirk *data = opaque;
> +    uint64_t val = 0;
> +
> +    memcpy(&val, (void *)data->membar_phys + addr, size);
> +    return val;
> +}
> +
> +static const MemoryRegionOps vfio_vmd_quirk = {
> +    .read = vfio_vmd_quirk_read,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +};
> +
> +#define VMD_VMLOCK  0x70
> +#define VMD_SHADOW  0x2000
> +#define VMD_MEMBAR2 4
> +
> +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> +{
> +    VFIOQuirk *quirk;
> +    VFIOVMDQuirk *data;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int ret;
> +
> +    data = g_malloc0(sizeof(*data));
> +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> +    if (ret != 16) {
> +        error_report("VMD %s cannot read MEMBARs (%d)",
> +                     vdev->vbasedev.name, ret);
> +        g_free(data);
> +        return -EFAULT;
> +    }
> +
> +    quirk = vfio_quirk_alloc(1);
> +    quirk->data = data;
> +    data->vdev = vdev;
> +
> +    /* Emulate Shadow Registers */
> +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> +                                        VMD_SHADOW, quirk->mem, 1);
> +    memory_region_set_readonly(quirk->mem, true);
> +    memory_region_set_enabled(quirk->mem, true);
> +
> +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> +
> +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> +                                         data->membar_phys[0],
> +                                         data->membar_phys[1]);
> +
> +    /* Advertise Shadow Register support */
> +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> +
> +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> +
> +    return 0;
> +}
> +
> +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> +{
> +    int ret = 0;
> +
> +    switch (vdev->device_id) {
> +    case 0x28C0: /* Native passthrough support */
> +        break;
> +    /* Emulates Native passthrough support */
> +    case 0x201D:
> +    case 0x467F:
> +    case 0x4C3D:
> +    case 0x9A0B:
> +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> +        break;
> +    }
> +
> +    return ret;
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5e75a95129..85425a1a6f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> +        ret = vfio_pci_vmd_init(vdev);
> +        if (ret) {
> +            error_report("Failed to setup VMD");
> +        }
> +    }
> +
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 0da7a20a7e..e8632d806b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
>  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
>  
> +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> +
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>  void vfio_display_finalize(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index b1ef55a33f..ed64e477db 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
>  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
>  
> +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> +
>  # common.c
>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-11 22:59     ` Alex Williamson
  0 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-11 22:59 UTC (permalink / raw)
  To: Jon Derrick
  Cc: Lorenzo Pieralisi, linux-pci, qemu-devel, virtualization,
	Andrzej Jakowski, Bjorn Helgaas, Christoph Hellwig

On Mon, 11 May 2020 15:01:27 -0400
Jon Derrick <jonathan.derrick@intel.com> wrote:

> The VMD endpoint provides a real PCIe domain to the guest, including

Please define VMD.  I'm sure this is obvious to many, but I've had to
do some research.  The best TL;DR summary I've found is Keith's
original commit 185a383ada2e adding the controller to Linux.  If there's
something better, please let me know.

> bridges and endpoints. Because the VMD domain is enumerated by the guest
> kernel, the guest kernel will assign Guest Physical Addresses to the
> downstream endpoint BARs and bridge windows.
>
> When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
> translate from the guest address space to the physical address space.
> Because the bridges have been programmed with guest addresses, the
> bridges will reject the transaction containing physical addresses.

I'm lost, what IOMMU is involved in CPU access to MMIO space?  My guess
is that since all MMIO of this domain is mapped behind the host
endpoint BARs 2 & 4 that QEMU simply accesses it via mapping of those
BARs into the VM, so it's the MMU, not the IOMMU performing those GPA
to HPA translations.  But then presumably the bridges within the domain
are scrambled because their apertures are programmed with ranges that
don't map into the VMD endpoint BARs.  Is that remotely correct?  Some
/proc/iomem output and/or lspci listing from the host to see how this
works would be useful.

> VMD device 28C0 natively assists passthrough by providing the Host
> Physical Address in shadow registers accessible to the guest for bridge
> window assignment. The shadow registers are valid if bit 1 is set in VMD
> VMLOCK config register 0x70. Future VMDs will also support this feature.
> Existing VMDs have config register 0x70 reserved, and will return 0 on
> reads.

So these shadow registers are simply exposing the host BAR2 & BAR4
addresses into the guest, so the quirk is dependent on reading those
values from the device before anyone has written to them and the BAR
emulation in the kernel kicks in (not a problem, just an observation).

Does the VMD controller code then use these bases addresses to program
the bridges/endpoint within the domain?  What does the same /proc/iomem
or lspci look like inside the guest then?  It seems like we'd see the
VMD endpoint with GPA BARs, but the devices within the domain using
HPAs.  If that's remotely true, and we're not forcing an identity
mapping of this HPA range into the GPA, does the vmd controller driver
impose a TRA function on these MMIO addresses in the guest?

Sorry if I'm way off, I'm piecing things together from scant
information here.  Please Cc me on future vfio related patches.  Thanks,

Alex

 
> In order to support existing VMDs, this quirk emulates the VMLOCK and
> HPA shadow registers for all VMD device ids which don't natively assist
> with passthrough. The Linux VMD driver is updated to allow existing VMD
> devices to query VMLOCK for passthrough support.
> 
> Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> ---
>  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c        |   7 +++
>  hw/vfio/pci.h        |   2 +
>  hw/vfio/trace-events |   3 ++
>  4 files changed, 115 insertions(+)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index 2d348f8237..4060a6a95d 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -1709,3 +1709,106 @@ free_exit:
>  
>      return ret;
>  }
> +
> +/*
> + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> + * to VMD sub-devices go through IOMMU translation from guest addresses to
> + * physical addresses. When MMIO goes to an endpoint after being translated to
> + * physical addresses, the bridge rejects the transaction because the window
> + * has been programmed with guest addresses.
> + *
> + * VMD can use the Host Physical Address in order to correctly program the
> + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> + * this native assistance can have these registers safely emulated as these
> + * registers are reserved.
> + */
> +typedef struct VFIOVMDQuirk {
> +    VFIOPCIDevice *vdev;
> +    uint64_t membar_phys[2];
> +} VFIOVMDQuirk;
> +
> +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    VFIOVMDQuirk *data = opaque;
> +    uint64_t val = 0;
> +
> +    memcpy(&val, (void *)data->membar_phys + addr, size);
> +    return val;
> +}
> +
> +static const MemoryRegionOps vfio_vmd_quirk = {
> +    .read = vfio_vmd_quirk_read,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +};
> +
> +#define VMD_VMLOCK  0x70
> +#define VMD_SHADOW  0x2000
> +#define VMD_MEMBAR2 4
> +
> +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> +{
> +    VFIOQuirk *quirk;
> +    VFIOVMDQuirk *data;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int ret;
> +
> +    data = g_malloc0(sizeof(*data));
> +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> +    if (ret != 16) {
> +        error_report("VMD %s cannot read MEMBARs (%d)",
> +                     vdev->vbasedev.name, ret);
> +        g_free(data);
> +        return -EFAULT;
> +    }
> +
> +    quirk = vfio_quirk_alloc(1);
> +    quirk->data = data;
> +    data->vdev = vdev;
> +
> +    /* Emulate Shadow Registers */
> +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> +                                        VMD_SHADOW, quirk->mem, 1);
> +    memory_region_set_readonly(quirk->mem, true);
> +    memory_region_set_enabled(quirk->mem, true);
> +
> +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> +
> +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> +                                         data->membar_phys[0],
> +                                         data->membar_phys[1]);
> +
> +    /* Advertise Shadow Register support */
> +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> +
> +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> +
> +    return 0;
> +}
> +
> +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> +{
> +    int ret = 0;
> +
> +    switch (vdev->device_id) {
> +    case 0x28C0: /* Native passthrough support */
> +        break;
> +    /* Emulates Native passthrough support */
> +    case 0x201D:
> +    case 0x467F:
> +    case 0x4C3D:
> +    case 0x9A0B:
> +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> +        break;
> +    }
> +
> +    return ret;
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5e75a95129..85425a1a6f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> +        ret = vfio_pci_vmd_init(vdev);
> +        if (ret) {
> +            error_report("Failed to setup VMD");
> +        }
> +    }
> +
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 0da7a20a7e..e8632d806b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
>  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
>  
> +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> +
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>  void vfio_display_finalize(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index b1ef55a33f..ed64e477db 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
>  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
>  
> +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> +
>  # common.c
>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-11 22:59     ` Alex Williamson
  0 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-11 22:59 UTC (permalink / raw)
  To: Jon Derrick
  Cc: linux-pci, qemu-devel, Lorenzo Pieralisi, virtualization,
	Andrzej Jakowski, Bjorn Helgaas, Christoph Hellwig

On Mon, 11 May 2020 15:01:27 -0400
Jon Derrick <jonathan.derrick@intel.com> wrote:

> The VMD endpoint provides a real PCIe domain to the guest, including

Please define VMD.  I'm sure this is obvious to many, but I've had to
do some research.  The best TL;DR summary I've found is Keith's
original commit 185a383ada2e adding the controller to Linux.  If there's
something better, please let me know.

> bridges and endpoints. Because the VMD domain is enumerated by the guest
> kernel, the guest kernel will assign Guest Physical Addresses to the
> downstream endpoint BARs and bridge windows.
>
> When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
> translate from the guest address space to the physical address space.
> Because the bridges have been programmed with guest addresses, the
> bridges will reject the transaction containing physical addresses.

I'm lost, what IOMMU is involved in CPU access to MMIO space?  My guess
is that since all MMIO of this domain is mapped behind the host
endpoint BARs 2 & 4 that QEMU simply accesses it via mapping of those
BARs into the VM, so it's the MMU, not the IOMMU performing those GPA
to HPA translations.  But then presumably the bridges within the domain
are scrambled because their apertures are programmed with ranges that
don't map into the VMD endpoint BARs.  Is that remotely correct?  Some
/proc/iomem output and/or lspci listing from the host to see how this
works would be useful.

> VMD device 28C0 natively assists passthrough by providing the Host
> Physical Address in shadow registers accessible to the guest for bridge
> window assignment. The shadow registers are valid if bit 1 is set in VMD
> VMLOCK config register 0x70. Future VMDs will also support this feature.
> Existing VMDs have config register 0x70 reserved, and will return 0 on
> reads.

So these shadow registers are simply exposing the host BAR2 & BAR4
addresses into the guest, so the quirk is dependent on reading those
values from the device before anyone has written to them and the BAR
emulation in the kernel kicks in (not a problem, just an observation).

Does the VMD controller code then use these bases addresses to program
the bridges/endpoint within the domain?  What does the same /proc/iomem
or lspci look like inside the guest then?  It seems like we'd see the
VMD endpoint with GPA BARs, but the devices within the domain using
HPAs.  If that's remotely true, and we're not forcing an identity
mapping of this HPA range into the GPA, does the vmd controller driver
impose a TRA function on these MMIO addresses in the guest?

Sorry if I'm way off, I'm piecing things together from scant
information here.  Please Cc me on future vfio related patches.  Thanks,

Alex

 
> In order to support existing VMDs, this quirk emulates the VMLOCK and
> HPA shadow registers for all VMD device ids which don't natively assist
> with passthrough. The Linux VMD driver is updated to allow existing VMD
> devices to query VMLOCK for passthrough support.
> 
> Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> ---
>  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c        |   7 +++
>  hw/vfio/pci.h        |   2 +
>  hw/vfio/trace-events |   3 ++
>  4 files changed, 115 insertions(+)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index 2d348f8237..4060a6a95d 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -1709,3 +1709,106 @@ free_exit:
>  
>      return ret;
>  }
> +
> +/*
> + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> + * to VMD sub-devices go through IOMMU translation from guest addresses to
> + * physical addresses. When MMIO goes to an endpoint after being translated to
> + * physical addresses, the bridge rejects the transaction because the window
> + * has been programmed with guest addresses.
> + *
> + * VMD can use the Host Physical Address in order to correctly program the
> + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> + * this native assistance can have these registers safely emulated as these
> + * registers are reserved.
> + */
> +typedef struct VFIOVMDQuirk {
> +    VFIOPCIDevice *vdev;
> +    uint64_t membar_phys[2];
> +} VFIOVMDQuirk;
> +
> +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    VFIOVMDQuirk *data = opaque;
> +    uint64_t val = 0;
> +
> +    memcpy(&val, (void *)data->membar_phys + addr, size);
> +    return val;
> +}
> +
> +static const MemoryRegionOps vfio_vmd_quirk = {
> +    .read = vfio_vmd_quirk_read,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +};
> +
> +#define VMD_VMLOCK  0x70
> +#define VMD_SHADOW  0x2000
> +#define VMD_MEMBAR2 4
> +
> +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> +{
> +    VFIOQuirk *quirk;
> +    VFIOVMDQuirk *data;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int ret;
> +
> +    data = g_malloc0(sizeof(*data));
> +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> +    if (ret != 16) {
> +        error_report("VMD %s cannot read MEMBARs (%d)",
> +                     vdev->vbasedev.name, ret);
> +        g_free(data);
> +        return -EFAULT;
> +    }
> +
> +    quirk = vfio_quirk_alloc(1);
> +    quirk->data = data;
> +    data->vdev = vdev;
> +
> +    /* Emulate Shadow Registers */
> +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> +                                        VMD_SHADOW, quirk->mem, 1);
> +    memory_region_set_readonly(quirk->mem, true);
> +    memory_region_set_enabled(quirk->mem, true);
> +
> +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> +
> +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> +                                         data->membar_phys[0],
> +                                         data->membar_phys[1]);
> +
> +    /* Advertise Shadow Register support */
> +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> +
> +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> +
> +    return 0;
> +}
> +
> +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> +{
> +    int ret = 0;
> +
> +    switch (vdev->device_id) {
> +    case 0x28C0: /* Native passthrough support */
> +        break;
> +    /* Emulates Native passthrough support */
> +    case 0x201D:
> +    case 0x467F:
> +    case 0x4C3D:
> +    case 0x9A0B:
> +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> +        break;
> +    }
> +
> +    return ret;
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5e75a95129..85425a1a6f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> +        ret = vfio_pci_vmd_init(vdev);
> +        if (ret) {
> +            error_report("Failed to setup VMD");
> +        }
> +    }
> +
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 0da7a20a7e..e8632d806b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
>  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
>  
> +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> +
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>  void vfio_display_finalize(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index b1ef55a33f..ed64e477db 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
>  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
>  
> +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> +
>  # common.c
>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
  2020-05-11 22:59     ` Alex Williamson
@ 2020-05-13  0:35       ` Derrick, Jonathan
  -1 siblings, 0 replies; 25+ messages in thread
From: Derrick, Jonathan @ 2020-05-13  0:35 UTC (permalink / raw)
  To: alex.williamson
  Cc: hch, linux-pci, lorenzo.pieralisi, helgaas, virtualization,
	qemu-devel, andrzej.jakowski

Hi Alex,

I'm probably not getting the translation technical details correct.

On Mon, 2020-05-11 at 16:59 -0600, Alex Williamson wrote:
> On Mon, 11 May 2020 15:01:27 -0400
> Jon Derrick <jonathan.derrick@intel.com> wrote:
> 
> > The VMD endpoint provides a real PCIe domain to the guest, including
> 
> Please define VMD.  I'm sure this is obvious to many, but I've had to
> do some research.  The best TL;DR summary I've found is Keith's
> original commit 185a383ada2e adding the controller to Linux.  If there's
> something better, please let me know.
That's the correct commit, but I'll try to summarize the important bits
for v3.

> 
> > bridges and endpoints. Because the VMD domain is enumerated by the guest
> > kernel, the guest kernel will assign Guest Physical Addresses to the
> > downstream endpoint BARs and bridge windows.
> > 
> > When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
> > translate from the guest address space to the physical address space.
> > Because the bridges have been programmed with guest addresses, the
> > bridges will reject the transaction containing physical addresses.
> 
> I'm lost, what IOMMU is involved in CPU access to MMIO space?  My guess
> is that since all MMIO of this domain is mapped behind the host
> endpoint BARs 2 & 4 that QEMU simply accesses it via mapping of those
> BARs into the VM, so it's the MMU, not the IOMMU performing those GPA
> to HPA translations.  But then presumably the bridges within the domain
> are scrambled because their apertures are programmed with ranges that
> don't map into the VMD endpoint BARs.  Is that remotely correct?  Some
> /proc/iomem output and/or lspci listing from the host to see how this
> works would be useful.
Correct. So MMU not IOMMU.

In the guest kernel, the bridges and devices in the VMD domain are
programmed with the addresses provided in the VMD endpoint's BAR2&4
(MEMBAR1&2). Because these BARs are populated with guest addresses, MMU
translates to host physical and the bridge window rejects MMIO not in
its [GPA] range.

As an example:
Host:
  94000000-97ffffff : 0000:17:05.5
    94000000-97ffffff : VMD MEMBAR1
      94000000-943fffff : PCI Bus 10000:01
        94000000-9400ffff : 10000:01:00.0
        94010000-94013fff : 10000:01:00.0
          94010000-94013fff : nvme
      94400000-947fffff : PCI Bus 10000:01
      94800000-94bfffff : PCI Bus 10000:02
        94800000-9480ffff : 10000:02:00.0
        94810000-94813fff : 10000:02:00.0
          94810000-94813fff : nvme
      94c00000-94ffffff : PCI Bus 10000:02


MEMBAR 2 is similarly assigned

> 
> > VMD device 28C0 natively assists passthrough by providing the Host
> > Physical Address in shadow registers accessible to the guest for bridge
> > window assignment. The shadow registers are valid if bit 1 is set in VMD
> > VMLOCK config register 0x70. Future VMDs will also support this feature.
> > Existing VMDs have config register 0x70 reserved, and will return 0 on
> > reads.
> 
> So these shadow registers are simply exposing the host BAR2 & BAR4
> addresses into the guest, so the quirk is dependent on reading those
> values from the device before anyone has written to them and the BAR
> emulation in the kernel kicks in (not a problem, just an observation).
It's not expected that there will be anything writing that resource and
those registers are read-only.
The first 0x2000 of MEMBAR2 (BAR4) contain msix tables, and mappings to
subordinate buses are on 1MB aligned.


> Does the VMD controller code then use these bases addresses to program
> the bridges/endpoint within the domain?  What does the same /proc/iomem
> or lspci look like inside the guest then?  It seems like we'd see the
> VMD endpoint with GPA BARs, but the devices within the domain using
> HPAs.  If that's remotely true, and we're not forcing an identity
> mapping of this HPA range into the GPA, does the vmd controller driver
> impose a TRA function on these MMIO addresses in the guest?

This is the guest with the guest addresses:
  f8000000-fbffffff : 0000:00:07.0
    f8000000-fbffffff : VMD MEMBAR1
  
    f8000000-f83fffff : PCI Bus 10000:01
        f8000000-f800ffff :
10000:01:00.0
        f8010000-f8013fff : 10000:01:00.0
          f801000
0-f8013fff : nvme
      f8400000-f87fffff : PCI Bus 10000:01
      f88000
00-f8bfffff : PCI Bus 10000:02
        f8800000-f880ffff : 10000:02:00.0
        f8810000-f8813fff : 10000:02:00.0
          f8810000-f8813fff :
nvme
      f8c00000-f8ffffff : PCI Bus 10000:02


The VMD guest driver does the translation on the supplied address using
pci_add_resource_offset prior to creating the root bus and enumerating
the domain:

	offset[0] = vmd->dev->resource[VMD_MEMBAR1].start -
			readq(membar2 + MB2_SHADOW_OFFSET);
	offset[1] = vmd->dev->resource[VMD_MEMBAR2].start -
			readq(membar2 + MB2_SHADOW_OFFSET + 8);
...
	pci_add_resource(&resources, &vmd->resources[0]);
	pci_add_resource_offset(&resources, &vmd->resources[1], offset[0]);
	pci_add_resource_offset(&resources, &vmd->resources[2], offset[1]);

	vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
				       &vmd_ops, sd, &resources);


The offset becomes the CPU-to-bridge translation function for
programming the guest's VMD domain with bus addresses.


In the patched guest's lspci, for the bridge you see:
# lspci -v -s 10000:00:02.0
10000:00:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C
	...
	Memory behind bridge: 94000000-943fffff


But the kernel doesn't export the offset BAR for the endpoint:
# lspci -v -s 10000:02:00.0
10000:02:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD
	...
	Memory at f8810000 (64-bit, non-prefetchable) [size=16K]
	[virtual] Expansion ROM at f8800000 [disabled] [size=64K]

The endpoint is still programmed with the offset:
# setpci -s 10000:02:00.0 10.l
94810004


> 
> Sorry if I'm way off, I'm piecing things together from scant
> information here.  Please Cc me on future vfio related patches.  Thanks,
> 
> Alex
> 
No problem

Thanks,
Jon

>  
> > In order to support existing VMDs, this quirk emulates the VMLOCK and
> > HPA shadow registers for all VMD device ids which don't natively assist
> > with passthrough. The Linux VMD driver is updated to allow existing VMD
> > devices to query VMLOCK for passthrough support.
> > 
> > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > ---
> >  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c        |   7 +++
> >  hw/vfio/pci.h        |   2 +
> >  hw/vfio/trace-events |   3 ++
> >  4 files changed, 115 insertions(+)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index 2d348f8237..4060a6a95d 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -1709,3 +1709,106 @@ free_exit:
> >  
> >      return ret;
> >  }
> > +
> > +/*
> > + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> > + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> > + * to VMD sub-devices go through IOMMU translation from guest addresses to
> > + * physical addresses. When MMIO goes to an endpoint after being translated to
> > + * physical addresses, the bridge rejects the transaction because the window
> > + * has been programmed with guest addresses.
> > + *
> > + * VMD can use the Host Physical Address in order to correctly program the
> > + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> > + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> > + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> > + * this native assistance can have these registers safely emulated as these
> > + * registers are reserved.
> > + */
> > +typedef struct VFIOVMDQuirk {
> > +    VFIOPCIDevice *vdev;
> > +    uint64_t membar_phys[2];
> > +} VFIOVMDQuirk;
> > +
> > +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> > +{
> > +    VFIOVMDQuirk *data = opaque;
> > +    uint64_t val = 0;
> > +
> > +    memcpy(&val, (void *)data->membar_phys + addr, size);
> > +    return val;
> > +}
> > +
> > +static const MemoryRegionOps vfio_vmd_quirk = {
> > +    .read = vfio_vmd_quirk_read,
> > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > +};
> > +
> > +#define VMD_VMLOCK  0x70
> > +#define VMD_SHADOW  0x2000
> > +#define VMD_MEMBAR2 4
> > +
> > +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> > +{
> > +    VFIOQuirk *quirk;
> > +    VFIOVMDQuirk *data;
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    int ret;
> > +
> > +    data = g_malloc0(sizeof(*data));
> > +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> > +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> > +    if (ret != 16) {
> > +        error_report("VMD %s cannot read MEMBARs (%d)",
> > +                     vdev->vbasedev.name, ret);
> > +        g_free(data);
> > +        return -EFAULT;
> > +    }
> > +
> > +    quirk = vfio_quirk_alloc(1);
> > +    quirk->data = data;
> > +    data->vdev = vdev;
> > +
> > +    /* Emulate Shadow Registers */
> > +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> > +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> > +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> > +                                        VMD_SHADOW, quirk->mem, 1);
> > +    memory_region_set_readonly(quirk->mem, true);
> > +    memory_region_set_enabled(quirk->mem, true);
> > +
> > +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> > +
> > +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> > +                                         data->membar_phys[0],
> > +                                         data->membar_phys[1]);
> > +
> > +    /* Advertise Shadow Register support */
> > +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> > +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> > +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> > +
> > +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> > +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> > +
> > +    return 0;
> > +}
> > +
> > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> > +{
> > +    int ret = 0;
> > +
> > +    switch (vdev->device_id) {
> > +    case 0x28C0: /* Native passthrough support */
> > +        break;
> > +    /* Emulates Native passthrough support */
> > +    case 0x201D:
> > +    case 0x467F:
> > +    case 0x4C3D:
> > +    case 0x9A0B:
> > +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> > +        break;
> > +    }
> > +
> > +    return ret;
> > +}
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 5e75a95129..85425a1a6f 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >          }
> >      }
> >  
> > +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> > +        ret = vfio_pci_vmd_init(vdev);
> > +        if (ret) {
> > +            error_report("Failed to setup VMD");
> > +        }
> > +    }
> > +
> >      vfio_register_err_notifier(vdev);
> >      vfio_register_req_notifier(vdev);
> >      vfio_setup_resetfn_quirk(vdev);
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 0da7a20a7e..e8632d806b 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> >  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> >  
> > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> > +
> >  void vfio_display_reset(VFIOPCIDevice *vdev);
> >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index b1ef55a33f..ed64e477db 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
> >  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> >  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> >  
> > +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> > +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> > +
> >  # common.c
> >  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> >  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-13  0:35       ` Derrick, Jonathan
  0 siblings, 0 replies; 25+ messages in thread
From: Derrick, Jonathan @ 2020-05-13  0:35 UTC (permalink / raw)
  To: alex.williamson
  Cc: lorenzo.pieralisi, linux-pci, qemu-devel, virtualization,
	andrzej.jakowski, helgaas, hch

Hi Alex,

I'm probably not getting the translation technical details correct.

On Mon, 2020-05-11 at 16:59 -0600, Alex Williamson wrote:
> On Mon, 11 May 2020 15:01:27 -0400
> Jon Derrick <jonathan.derrick@intel.com> wrote:
> 
> > The VMD endpoint provides a real PCIe domain to the guest, including
> 
> Please define VMD.  I'm sure this is obvious to many, but I've had to
> do some research.  The best TL;DR summary I've found is Keith's
> original commit 185a383ada2e adding the controller to Linux.  If there's
> something better, please let me know.
That's the correct commit, but I'll try to summarize the important bits
for v3.

> 
> > bridges and endpoints. Because the VMD domain is enumerated by the guest
> > kernel, the guest kernel will assign Guest Physical Addresses to the
> > downstream endpoint BARs and bridge windows.
> > 
> > When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
> > translate from the guest address space to the physical address space.
> > Because the bridges have been programmed with guest addresses, the
> > bridges will reject the transaction containing physical addresses.
> 
> I'm lost, what IOMMU is involved in CPU access to MMIO space?  My guess
> is that since all MMIO of this domain is mapped behind the host
> endpoint BARs 2 & 4 that QEMU simply accesses it via mapping of those
> BARs into the VM, so it's the MMU, not the IOMMU performing those GPA
> to HPA translations.  But then presumably the bridges within the domain
> are scrambled because their apertures are programmed with ranges that
> don't map into the VMD endpoint BARs.  Is that remotely correct?  Some
> /proc/iomem output and/or lspci listing from the host to see how this
> works would be useful.
Correct. So MMU not IOMMU.

In the guest kernel, the bridges and devices in the VMD domain are
programmed with the addresses provided in the VMD endpoint's BAR2&4
(MEMBAR1&2). Because these BARs are populated with guest addresses, MMU
translates to host physical and the bridge window rejects MMIO not in
its [GPA] range.

As an example:
Host:
  94000000-97ffffff : 0000:17:05.5
    94000000-97ffffff : VMD MEMBAR1
      94000000-943fffff : PCI Bus 10000:01
        94000000-9400ffff : 10000:01:00.0
        94010000-94013fff : 10000:01:00.0
          94010000-94013fff : nvme
      94400000-947fffff : PCI Bus 10000:01
      94800000-94bfffff : PCI Bus 10000:02
        94800000-9480ffff : 10000:02:00.0
        94810000-94813fff : 10000:02:00.0
          94810000-94813fff : nvme
      94c00000-94ffffff : PCI Bus 10000:02


MEMBAR 2 is similarly assigned

> 
> > VMD device 28C0 natively assists passthrough by providing the Host
> > Physical Address in shadow registers accessible to the guest for bridge
> > window assignment. The shadow registers are valid if bit 1 is set in VMD
> > VMLOCK config register 0x70. Future VMDs will also support this feature.
> > Existing VMDs have config register 0x70 reserved, and will return 0 on
> > reads.
> 
> So these shadow registers are simply exposing the host BAR2 & BAR4
> addresses into the guest, so the quirk is dependent on reading those
> values from the device before anyone has written to them and the BAR
> emulation in the kernel kicks in (not a problem, just an observation).
It's not expected that there will be anything writing that resource and
those registers are read-only.
The first 0x2000 of MEMBAR2 (BAR4) contain msix tables, and mappings to
subordinate buses are on 1MB aligned.


> Does the VMD controller code then use these bases addresses to program
> the bridges/endpoint within the domain?  What does the same /proc/iomem
> or lspci look like inside the guest then?  It seems like we'd see the
> VMD endpoint with GPA BARs, but the devices within the domain using
> HPAs.  If that's remotely true, and we're not forcing an identity
> mapping of this HPA range into the GPA, does the vmd controller driver
> impose a TRA function on these MMIO addresses in the guest?

This is the guest with the guest addresses:
  f8000000-fbffffff : 0000:00:07.0
    f8000000-fbffffff : VMD MEMBAR1
  
    f8000000-f83fffff : PCI Bus 10000:01
        f8000000-f800ffff :
10000:01:00.0
        f8010000-f8013fff : 10000:01:00.0
          f801000
0-f8013fff : nvme
      f8400000-f87fffff : PCI Bus 10000:01
      f88000
00-f8bfffff : PCI Bus 10000:02
        f8800000-f880ffff : 10000:02:00.0
        f8810000-f8813fff : 10000:02:00.0
          f8810000-f8813fff :
nvme
      f8c00000-f8ffffff : PCI Bus 10000:02


The VMD guest driver does the translation on the supplied address using
pci_add_resource_offset prior to creating the root bus and enumerating
the domain:

	offset[0] = vmd->dev->resource[VMD_MEMBAR1].start -
			readq(membar2 + MB2_SHADOW_OFFSET);
	offset[1] = vmd->dev->resource[VMD_MEMBAR2].start -
			readq(membar2 + MB2_SHADOW_OFFSET + 8);
...
	pci_add_resource(&resources, &vmd->resources[0]);
	pci_add_resource_offset(&resources, &vmd->resources[1], offset[0]);
	pci_add_resource_offset(&resources, &vmd->resources[2], offset[1]);

	vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
				       &vmd_ops, sd, &resources);


The offset becomes the CPU-to-bridge translation function for
programming the guest's VMD domain with bus addresses.


In the patched guest's lspci, for the bridge you see:
# lspci -v -s 10000:00:02.0
10000:00:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C
	...
	Memory behind bridge: 94000000-943fffff


But the kernel doesn't export the offset BAR for the endpoint:
# lspci -v -s 10000:02:00.0
10000:02:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD
	...
	Memory at f8810000 (64-bit, non-prefetchable) [size=16K]
	[virtual] Expansion ROM at f8800000 [disabled] [size=64K]

The endpoint is still programmed with the offset:
# setpci -s 10000:02:00.0 10.l
94810004


> 
> Sorry if I'm way off, I'm piecing things together from scant
> information here.  Please Cc me on future vfio related patches.  Thanks,
> 
> Alex
> 
No problem

Thanks,
Jon

>  
> > In order to support existing VMDs, this quirk emulates the VMLOCK and
> > HPA shadow registers for all VMD device ids which don't natively assist
> > with passthrough. The Linux VMD driver is updated to allow existing VMD
> > devices to query VMLOCK for passthrough support.
> > 
> > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > ---
> >  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c        |   7 +++
> >  hw/vfio/pci.h        |   2 +
> >  hw/vfio/trace-events |   3 ++
> >  4 files changed, 115 insertions(+)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index 2d348f8237..4060a6a95d 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -1709,3 +1709,106 @@ free_exit:
> >  
> >      return ret;
> >  }
> > +
> > +/*
> > + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> > + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> > + * to VMD sub-devices go through IOMMU translation from guest addresses to
> > + * physical addresses. When MMIO goes to an endpoint after being translated to
> > + * physical addresses, the bridge rejects the transaction because the window
> > + * has been programmed with guest addresses.
> > + *
> > + * VMD can use the Host Physical Address in order to correctly program the
> > + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> > + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> > + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> > + * this native assistance can have these registers safely emulated as these
> > + * registers are reserved.
> > + */
> > +typedef struct VFIOVMDQuirk {
> > +    VFIOPCIDevice *vdev;
> > +    uint64_t membar_phys[2];
> > +} VFIOVMDQuirk;
> > +
> > +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> > +{
> > +    VFIOVMDQuirk *data = opaque;
> > +    uint64_t val = 0;
> > +
> > +    memcpy(&val, (void *)data->membar_phys + addr, size);
> > +    return val;
> > +}
> > +
> > +static const MemoryRegionOps vfio_vmd_quirk = {
> > +    .read = vfio_vmd_quirk_read,
> > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > +};
> > +
> > +#define VMD_VMLOCK  0x70
> > +#define VMD_SHADOW  0x2000
> > +#define VMD_MEMBAR2 4
> > +
> > +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> > +{
> > +    VFIOQuirk *quirk;
> > +    VFIOVMDQuirk *data;
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    int ret;
> > +
> > +    data = g_malloc0(sizeof(*data));
> > +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> > +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> > +    if (ret != 16) {
> > +        error_report("VMD %s cannot read MEMBARs (%d)",
> > +                     vdev->vbasedev.name, ret);
> > +        g_free(data);
> > +        return -EFAULT;
> > +    }
> > +
> > +    quirk = vfio_quirk_alloc(1);
> > +    quirk->data = data;
> > +    data->vdev = vdev;
> > +
> > +    /* Emulate Shadow Registers */
> > +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> > +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> > +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> > +                                        VMD_SHADOW, quirk->mem, 1);
> > +    memory_region_set_readonly(quirk->mem, true);
> > +    memory_region_set_enabled(quirk->mem, true);
> > +
> > +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> > +
> > +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> > +                                         data->membar_phys[0],
> > +                                         data->membar_phys[1]);
> > +
> > +    /* Advertise Shadow Register support */
> > +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> > +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> > +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> > +
> > +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> > +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> > +
> > +    return 0;
> > +}
> > +
> > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> > +{
> > +    int ret = 0;
> > +
> > +    switch (vdev->device_id) {
> > +    case 0x28C0: /* Native passthrough support */
> > +        break;
> > +    /* Emulates Native passthrough support */
> > +    case 0x201D:
> > +    case 0x467F:
> > +    case 0x4C3D:
> > +    case 0x9A0B:
> > +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> > +        break;
> > +    }
> > +
> > +    return ret;
> > +}
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 5e75a95129..85425a1a6f 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >          }
> >      }
> >  
> > +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> > +        ret = vfio_pci_vmd_init(vdev);
> > +        if (ret) {
> > +            error_report("Failed to setup VMD");
> > +        }
> > +    }
> > +
> >      vfio_register_err_notifier(vdev);
> >      vfio_register_req_notifier(vdev);
> >      vfio_setup_resetfn_quirk(vdev);
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 0da7a20a7e..e8632d806b 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> >  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> >  
> > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> > +
> >  void vfio_display_reset(VFIOPCIDevice *vdev);
> >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index b1ef55a33f..ed64e477db 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
> >  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> >  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> >  
> > +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> > +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> > +
> >  # common.c
> >  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> >  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
  2020-05-13  0:35       ` Derrick, Jonathan
  (?)
@ 2020-05-13 17:55         ` Alex Williamson
  -1 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-13 17:55 UTC (permalink / raw)
  To: Derrick, Jonathan
  Cc: hch, linux-pci, lorenzo.pieralisi, helgaas, virtualization,
	qemu-devel, andrzej.jakowski

On Wed, 13 May 2020 00:35:47 +0000
"Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:

> Hi Alex,
> 
> I'm probably not getting the translation technical details correct.
> 
> On Mon, 2020-05-11 at 16:59 -0600, Alex Williamson wrote:
> > On Mon, 11 May 2020 15:01:27 -0400
> > Jon Derrick <jonathan.derrick@intel.com> wrote:
> >   
> > > The VMD endpoint provides a real PCIe domain to the guest, including  
> > 
> > Please define VMD.  I'm sure this is obvious to many, but I've had to
> > do some research.  The best TL;DR summary I've found is Keith's
> > original commit 185a383ada2e adding the controller to Linux.  If there's
> > something better, please let me know.  
> That's the correct commit, but I'll try to summarize the important bits
> for v3.
> 
> >   
> > > bridges and endpoints. Because the VMD domain is enumerated by the guest
> > > kernel, the guest kernel will assign Guest Physical Addresses to the
> > > downstream endpoint BARs and bridge windows.
> > > 
> > > When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
> > > translate from the guest address space to the physical address space.
> > > Because the bridges have been programmed with guest addresses, the
> > > bridges will reject the transaction containing physical addresses.  
> > 
> > I'm lost, what IOMMU is involved in CPU access to MMIO space?  My guess
> > is that since all MMIO of this domain is mapped behind the host
> > endpoint BARs 2 & 4 that QEMU simply accesses it via mapping of those
> > BARs into the VM, so it's the MMU, not the IOMMU performing those GPA
> > to HPA translations.  But then presumably the bridges within the domain
> > are scrambled because their apertures are programmed with ranges that
> > don't map into the VMD endpoint BARs.  Is that remotely correct?  Some
> > /proc/iomem output and/or lspci listing from the host to see how this
> > works would be useful.  
> Correct. So MMU not IOMMU.
> 
> In the guest kernel, the bridges and devices in the VMD domain are
> programmed with the addresses provided in the VMD endpoint's BAR2&4
> (MEMBAR1&2). Because these BARs are populated with guest addresses, MMU
> translates to host physical and the bridge window rejects MMIO not in
> its [GPA] range.
> 
> As an example:
> Host:
>   94000000-97ffffff : 0000:17:05.5
>     94000000-97ffffff : VMD MEMBAR1
>       94000000-943fffff : PCI Bus 10000:01
>         94000000-9400ffff : 10000:01:00.0
>         94010000-94013fff : 10000:01:00.0
>           94010000-94013fff : nvme
>       94400000-947fffff : PCI Bus 10000:01
>       94800000-94bfffff : PCI Bus 10000:02
>         94800000-9480ffff : 10000:02:00.0
>         94810000-94813fff : 10000:02:00.0
>           94810000-94813fff : nvme
>       94c00000-94ffffff : PCI Bus 10000:02
> 
> 
> MEMBAR 2 is similarly assigned
> 
> >   
> > > VMD device 28C0 natively assists passthrough by providing the Host
> > > Physical Address in shadow registers accessible to the guest for bridge
> > > window assignment. The shadow registers are valid if bit 1 is set in VMD
> > > VMLOCK config register 0x70. Future VMDs will also support this feature.
> > > Existing VMDs have config register 0x70 reserved, and will return 0 on
> > > reads.  
> > 
> > So these shadow registers are simply exposing the host BAR2 & BAR4
> > addresses into the guest, so the quirk is dependent on reading those
> > values from the device before anyone has written to them and the BAR
> > emulation in the kernel kicks in (not a problem, just an observation).  
> It's not expected that there will be anything writing that resource and
> those registers are read-only.
> The first 0x2000 of MEMBAR2 (BAR4) contain msix tables, and mappings to
> subordinate buses are on 1MB aligned.
> 
> 
> > Does the VMD controller code then use these bases addresses to program
> > the bridges/endpoint within the domain?  What does the same /proc/iomem
> > or lspci look like inside the guest then?  It seems like we'd see the
> > VMD endpoint with GPA BARs, but the devices within the domain using
> > HPAs.  If that's remotely true, and we're not forcing an identity
> > mapping of this HPA range into the GPA, does the vmd controller driver
> > impose a TRA function on these MMIO addresses in the guest?  
> 
> This is the guest with the guest addresses:
>   f8000000-fbffffff : 0000:00:07.0
>     f8000000-fbffffff : VMD MEMBAR1
>   
>     f8000000-f83fffff : PCI Bus 10000:01
>         f8000000-f800ffff :
> 10000:01:00.0
>         f8010000-f8013fff : 10000:01:00.0
>           f801000
> 0-f8013fff : nvme
>       f8400000-f87fffff : PCI Bus 10000:01
>       f88000
> 00-f8bfffff : PCI Bus 10000:02
>         f8800000-f880ffff : 10000:02:00.0
>         f8810000-f8813fff : 10000:02:00.0
>           f8810000-f8813fff :
> nvme
>       f8c00000-f8ffffff : PCI Bus 10000:02
> 
> 
> The VMD guest driver does the translation on the supplied address using
> pci_add_resource_offset prior to creating the root bus and enumerating
> the domain:
> 
> 	offset[0] = vmd->dev->resource[VMD_MEMBAR1].start -
> 			readq(membar2 + MB2_SHADOW_OFFSET);
> 	offset[1] = vmd->dev->resource[VMD_MEMBAR2].start -
> 			readq(membar2 + MB2_SHADOW_OFFSET + 8);
> ...
> 	pci_add_resource(&resources, &vmd->resources[0]);
> 	pci_add_resource_offset(&resources, &vmd->resources[1], offset[0]);
> 	pci_add_resource_offset(&resources, &vmd->resources[2], offset[1]);
> 
> 	vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
> 				       &vmd_ops, sd, &resources);
> 
> 
> The offset becomes the CPU-to-bridge translation function for
> programming the guest's VMD domain with bus addresses.
> 
> 
> In the patched guest's lspci, for the bridge you see:
> # lspci -v -s 10000:00:02.0
> 10000:00:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C
> 	...
> 	Memory behind bridge: 94000000-943fffff
> 
> 
> But the kernel doesn't export the offset BAR for the endpoint:
> # lspci -v -s 10000:02:00.0
> 10000:02:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD
> 	...
> 	Memory at f8810000 (64-bit, non-prefetchable) [size=16K]
> 	[virtual] Expansion ROM at f8800000 [disabled] [size=64K]
> 
> The endpoint is still programmed with the offset:
> # setpci -s 10000:02:00.0 10.l
> 94810004
> 
> 
> > 
> > Sorry if I'm way off, I'm piecing things together from scant
> > information here.  Please Cc me on future vfio related patches.  Thanks,
> > 
> > Alex
> >   
> No problem

Thanks for the confirmation.  The approach seems ok, but a subtle
side-effect of MemoryRegion quirks is that we're going to trap the
entire PAGE_SIZE range overlapping the quirk out to QEMU on guest
access.  The two registers at 0x2000 might be reserved for this
purpose, but is there anything else that could be performance sensitive
anywhere in that page?  If so, it might be less transparent, but more
efficient to invent a new quirk making use of config space (ex. adding
an emulated vendor capability somewhere known to be unused on the
device).  Thanks,

Alex

> > > In order to support existing VMDs, this quirk emulates the VMLOCK and
> > > HPA shadow registers for all VMD device ids which don't natively assist
> > > with passthrough. The Linux VMD driver is updated to allow existing VMD
> > > devices to query VMLOCK for passthrough support.
> > > 
> > > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > > ---
> > >  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c        |   7 +++
> > >  hw/vfio/pci.h        |   2 +
> > >  hw/vfio/trace-events |   3 ++
> > >  4 files changed, 115 insertions(+)
> > > 
> > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > index 2d348f8237..4060a6a95d 100644
> > > --- a/hw/vfio/pci-quirks.c
> > > +++ b/hw/vfio/pci-quirks.c
> > > @@ -1709,3 +1709,106 @@ free_exit:
> > >  
> > >      return ret;
> > >  }
> > > +
> > > +/*
> > > + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> > > + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> > > + * to VMD sub-devices go through IOMMU translation from guest addresses to
> > > + * physical addresses. When MMIO goes to an endpoint after being translated to
> > > + * physical addresses, the bridge rejects the transaction because the window
> > > + * has been programmed with guest addresses.
> > > + *
> > > + * VMD can use the Host Physical Address in order to correctly program the
> > > + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> > > + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> > > + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> > > + * this native assistance can have these registers safely emulated as these
> > > + * registers are reserved.
> > > + */
> > > +typedef struct VFIOVMDQuirk {
> > > +    VFIOPCIDevice *vdev;
> > > +    uint64_t membar_phys[2];
> > > +} VFIOVMDQuirk;
> > > +
> > > +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> > > +{
> > > +    VFIOVMDQuirk *data = opaque;
> > > +    uint64_t val = 0;
> > > +
> > > +    memcpy(&val, (void *)data->membar_phys + addr, size);
> > > +    return val;
> > > +}
> > > +
> > > +static const MemoryRegionOps vfio_vmd_quirk = {
> > > +    .read = vfio_vmd_quirk_read,
> > > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > > +};
> > > +
> > > +#define VMD_VMLOCK  0x70
> > > +#define VMD_SHADOW  0x2000
> > > +#define VMD_MEMBAR2 4
> > > +
> > > +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> > > +{
> > > +    VFIOQuirk *quirk;
> > > +    VFIOVMDQuirk *data;
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    int ret;
> > > +
> > > +    data = g_malloc0(sizeof(*data));
> > > +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> > > +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> > > +    if (ret != 16) {
> > > +        error_report("VMD %s cannot read MEMBARs (%d)",
> > > +                     vdev->vbasedev.name, ret);
> > > +        g_free(data);
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    quirk = vfio_quirk_alloc(1);
> > > +    quirk->data = data;
> > > +    data->vdev = vdev;
> > > +
> > > +    /* Emulate Shadow Registers */
> > > +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> > > +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> > > +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> > > +                                        VMD_SHADOW, quirk->mem, 1);
> > > +    memory_region_set_readonly(quirk->mem, true);
> > > +    memory_region_set_enabled(quirk->mem, true);
> > > +
> > > +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> > > +
> > > +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> > > +                                         data->membar_phys[0],
> > > +                                         data->membar_phys[1]);
> > > +
> > > +    /* Advertise Shadow Register support */
> > > +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> > > +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> > > +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> > > +
> > > +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> > > +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> > > +{
> > > +    int ret = 0;
> > > +
> > > +    switch (vdev->device_id) {
> > > +    case 0x28C0: /* Native passthrough support */
> > > +        break;
> > > +    /* Emulates Native passthrough support */
> > > +    case 0x201D:
> > > +    case 0x467F:
> > > +    case 0x4C3D:
> > > +    case 0x9A0B:
> > > +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> > > +        break;
> > > +    }
> > > +
> > > +    return ret;
> > > +}
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 5e75a95129..85425a1a6f 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> > >          }
> > >      }
> > >  
> > > +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> > > +        ret = vfio_pci_vmd_init(vdev);
> > > +        if (ret) {
> > > +            error_report("Failed to setup VMD");
> > > +        }
> > > +    }
> > > +
> > >      vfio_register_err_notifier(vdev);
> > >      vfio_register_req_notifier(vdev);
> > >      vfio_setup_resetfn_quirk(vdev);
> > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > index 0da7a20a7e..e8632d806b 100644
> > > --- a/hw/vfio/pci.h
> > > +++ b/hw/vfio/pci.h
> > > @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> > >  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> > >  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> > >  
> > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> > > +
> > >  void vfio_display_reset(VFIOPCIDevice *vdev);
> > >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> > >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > > index b1ef55a33f..ed64e477db 100644
> > > --- a/hw/vfio/trace-events
> > > +++ b/hw/vfio/trace-events
> > > @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
> > >  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> > >  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> > >  
> > > +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> > > +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> > > +
> > >  # common.c
> > >  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> > >  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64  


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-13 17:55         ` Alex Williamson
  0 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-13 17:55 UTC (permalink / raw)
  To: Derrick, Jonathan
  Cc: lorenzo.pieralisi, linux-pci, qemu-devel, virtualization,
	andrzej.jakowski, helgaas, hch

On Wed, 13 May 2020 00:35:47 +0000
"Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:

> Hi Alex,
> 
> I'm probably not getting the translation technical details correct.
> 
> On Mon, 2020-05-11 at 16:59 -0600, Alex Williamson wrote:
> > On Mon, 11 May 2020 15:01:27 -0400
> > Jon Derrick <jonathan.derrick@intel.com> wrote:
> >   
> > > The VMD endpoint provides a real PCIe domain to the guest, including  
> > 
> > Please define VMD.  I'm sure this is obvious to many, but I've had to
> > do some research.  The best TL;DR summary I've found is Keith's
> > original commit 185a383ada2e adding the controller to Linux.  If there's
> > something better, please let me know.  
> That's the correct commit, but I'll try to summarize the important bits
> for v3.
> 
> >   
> > > bridges and endpoints. Because the VMD domain is enumerated by the guest
> > > kernel, the guest kernel will assign Guest Physical Addresses to the
> > > downstream endpoint BARs and bridge windows.
> > > 
> > > When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
> > > translate from the guest address space to the physical address space.
> > > Because the bridges have been programmed with guest addresses, the
> > > bridges will reject the transaction containing physical addresses.  
> > 
> > I'm lost, what IOMMU is involved in CPU access to MMIO space?  My guess
> > is that since all MMIO of this domain is mapped behind the host
> > endpoint BARs 2 & 4 that QEMU simply accesses it via mapping of those
> > BARs into the VM, so it's the MMU, not the IOMMU performing those GPA
> > to HPA translations.  But then presumably the bridges within the domain
> > are scrambled because their apertures are programmed with ranges that
> > don't map into the VMD endpoint BARs.  Is that remotely correct?  Some
> > /proc/iomem output and/or lspci listing from the host to see how this
> > works would be useful.  
> Correct. So MMU not IOMMU.
> 
> In the guest kernel, the bridges and devices in the VMD domain are
> programmed with the addresses provided in the VMD endpoint's BAR2&4
> (MEMBAR1&2). Because these BARs are populated with guest addresses, MMU
> translates to host physical and the bridge window rejects MMIO not in
> its [GPA] range.
> 
> As an example:
> Host:
>   94000000-97ffffff : 0000:17:05.5
>     94000000-97ffffff : VMD MEMBAR1
>       94000000-943fffff : PCI Bus 10000:01
>         94000000-9400ffff : 10000:01:00.0
>         94010000-94013fff : 10000:01:00.0
>           94010000-94013fff : nvme
>       94400000-947fffff : PCI Bus 10000:01
>       94800000-94bfffff : PCI Bus 10000:02
>         94800000-9480ffff : 10000:02:00.0
>         94810000-94813fff : 10000:02:00.0
>           94810000-94813fff : nvme
>       94c00000-94ffffff : PCI Bus 10000:02
> 
> 
> MEMBAR 2 is similarly assigned
> 
> >   
> > > VMD device 28C0 natively assists passthrough by providing the Host
> > > Physical Address in shadow registers accessible to the guest for bridge
> > > window assignment. The shadow registers are valid if bit 1 is set in VMD
> > > VMLOCK config register 0x70. Future VMDs will also support this feature.
> > > Existing VMDs have config register 0x70 reserved, and will return 0 on
> > > reads.  
> > 
> > So these shadow registers are simply exposing the host BAR2 & BAR4
> > addresses into the guest, so the quirk is dependent on reading those
> > values from the device before anyone has written to them and the BAR
> > emulation in the kernel kicks in (not a problem, just an observation).  
> It's not expected that there will be anything writing that resource and
> those registers are read-only.
> The first 0x2000 of MEMBAR2 (BAR4) contain msix tables, and mappings to
> subordinate buses are on 1MB aligned.
> 
> 
> > Does the VMD controller code then use these bases addresses to program
> > the bridges/endpoint within the domain?  What does the same /proc/iomem
> > or lspci look like inside the guest then?  It seems like we'd see the
> > VMD endpoint with GPA BARs, but the devices within the domain using
> > HPAs.  If that's remotely true, and we're not forcing an identity
> > mapping of this HPA range into the GPA, does the vmd controller driver
> > impose a TRA function on these MMIO addresses in the guest?  
> 
> This is the guest with the guest addresses:
>   f8000000-fbffffff : 0000:00:07.0
>     f8000000-fbffffff : VMD MEMBAR1
>   
>     f8000000-f83fffff : PCI Bus 10000:01
>         f8000000-f800ffff :
> 10000:01:00.0
>         f8010000-f8013fff : 10000:01:00.0
>           f801000
> 0-f8013fff : nvme
>       f8400000-f87fffff : PCI Bus 10000:01
>       f88000
> 00-f8bfffff : PCI Bus 10000:02
>         f8800000-f880ffff : 10000:02:00.0
>         f8810000-f8813fff : 10000:02:00.0
>           f8810000-f8813fff :
> nvme
>       f8c00000-f8ffffff : PCI Bus 10000:02
> 
> 
> The VMD guest driver does the translation on the supplied address using
> pci_add_resource_offset prior to creating the root bus and enumerating
> the domain:
> 
> 	offset[0] = vmd->dev->resource[VMD_MEMBAR1].start -
> 			readq(membar2 + MB2_SHADOW_OFFSET);
> 	offset[1] = vmd->dev->resource[VMD_MEMBAR2].start -
> 			readq(membar2 + MB2_SHADOW_OFFSET + 8);
> ...
> 	pci_add_resource(&resources, &vmd->resources[0]);
> 	pci_add_resource_offset(&resources, &vmd->resources[1], offset[0]);
> 	pci_add_resource_offset(&resources, &vmd->resources[2], offset[1]);
> 
> 	vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
> 				       &vmd_ops, sd, &resources);
> 
> 
> The offset becomes the CPU-to-bridge translation function for
> programming the guest's VMD domain with bus addresses.
> 
> 
> In the patched guest's lspci, for the bridge you see:
> # lspci -v -s 10000:00:02.0
> 10000:00:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C
> 	...
> 	Memory behind bridge: 94000000-943fffff
> 
> 
> But the kernel doesn't export the offset BAR for the endpoint:
> # lspci -v -s 10000:02:00.0
> 10000:02:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD
> 	...
> 	Memory at f8810000 (64-bit, non-prefetchable) [size=16K]
> 	[virtual] Expansion ROM at f8800000 [disabled] [size=64K]
> 
> The endpoint is still programmed with the offset:
> # setpci -s 10000:02:00.0 10.l
> 94810004
> 
> 
> > 
> > Sorry if I'm way off, I'm piecing things together from scant
> > information here.  Please Cc me on future vfio related patches.  Thanks,
> > 
> > Alex
> >   
> No problem

Thanks for the confirmation.  The approach seems ok, but a subtle
side-effect of MemoryRegion quirks is that we're going to trap the
entire PAGE_SIZE range overlapping the quirk out to QEMU on guest
access.  The two registers at 0x2000 might be reserved for this
purpose, but is there anything else that could be performance sensitive
anywhere in that page?  If so, it might be less transparent, but more
efficient to invent a new quirk making use of config space (ex. adding
an emulated vendor capability somewhere known to be unused on the
device).  Thanks,

Alex

> > > In order to support existing VMDs, this quirk emulates the VMLOCK and
> > > HPA shadow registers for all VMD device ids which don't natively assist
> > > with passthrough. The Linux VMD driver is updated to allow existing VMD
> > > devices to query VMLOCK for passthrough support.
> > > 
> > > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > > ---
> > >  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c        |   7 +++
> > >  hw/vfio/pci.h        |   2 +
> > >  hw/vfio/trace-events |   3 ++
> > >  4 files changed, 115 insertions(+)
> > > 
> > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > index 2d348f8237..4060a6a95d 100644
> > > --- a/hw/vfio/pci-quirks.c
> > > +++ b/hw/vfio/pci-quirks.c
> > > @@ -1709,3 +1709,106 @@ free_exit:
> > >  
> > >      return ret;
> > >  }
> > > +
> > > +/*
> > > + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> > > + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> > > + * to VMD sub-devices go through IOMMU translation from guest addresses to
> > > + * physical addresses. When MMIO goes to an endpoint after being translated to
> > > + * physical addresses, the bridge rejects the transaction because the window
> > > + * has been programmed with guest addresses.
> > > + *
> > > + * VMD can use the Host Physical Address in order to correctly program the
> > > + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> > > + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> > > + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> > > + * this native assistance can have these registers safely emulated as these
> > > + * registers are reserved.
> > > + */
> > > +typedef struct VFIOVMDQuirk {
> > > +    VFIOPCIDevice *vdev;
> > > +    uint64_t membar_phys[2];
> > > +} VFIOVMDQuirk;
> > > +
> > > +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> > > +{
> > > +    VFIOVMDQuirk *data = opaque;
> > > +    uint64_t val = 0;
> > > +
> > > +    memcpy(&val, (void *)data->membar_phys + addr, size);
> > > +    return val;
> > > +}
> > > +
> > > +static const MemoryRegionOps vfio_vmd_quirk = {
> > > +    .read = vfio_vmd_quirk_read,
> > > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > > +};
> > > +
> > > +#define VMD_VMLOCK  0x70
> > > +#define VMD_SHADOW  0x2000
> > > +#define VMD_MEMBAR2 4
> > > +
> > > +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> > > +{
> > > +    VFIOQuirk *quirk;
> > > +    VFIOVMDQuirk *data;
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    int ret;
> > > +
> > > +    data = g_malloc0(sizeof(*data));
> > > +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> > > +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> > > +    if (ret != 16) {
> > > +        error_report("VMD %s cannot read MEMBARs (%d)",
> > > +                     vdev->vbasedev.name, ret);
> > > +        g_free(data);
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    quirk = vfio_quirk_alloc(1);
> > > +    quirk->data = data;
> > > +    data->vdev = vdev;
> > > +
> > > +    /* Emulate Shadow Registers */
> > > +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> > > +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> > > +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> > > +                                        VMD_SHADOW, quirk->mem, 1);
> > > +    memory_region_set_readonly(quirk->mem, true);
> > > +    memory_region_set_enabled(quirk->mem, true);
> > > +
> > > +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> > > +
> > > +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> > > +                                         data->membar_phys[0],
> > > +                                         data->membar_phys[1]);
> > > +
> > > +    /* Advertise Shadow Register support */
> > > +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> > > +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> > > +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> > > +
> > > +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> > > +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> > > +{
> > > +    int ret = 0;
> > > +
> > > +    switch (vdev->device_id) {
> > > +    case 0x28C0: /* Native passthrough support */
> > > +        break;
> > > +    /* Emulates Native passthrough support */
> > > +    case 0x201D:
> > > +    case 0x467F:
> > > +    case 0x4C3D:
> > > +    case 0x9A0B:
> > > +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> > > +        break;
> > > +    }
> > > +
> > > +    return ret;
> > > +}
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 5e75a95129..85425a1a6f 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> > >          }
> > >      }
> > >  
> > > +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> > > +        ret = vfio_pci_vmd_init(vdev);
> > > +        if (ret) {
> > > +            error_report("Failed to setup VMD");
> > > +        }
> > > +    }
> > > +
> > >      vfio_register_err_notifier(vdev);
> > >      vfio_register_req_notifier(vdev);
> > >      vfio_setup_resetfn_quirk(vdev);
> > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > index 0da7a20a7e..e8632d806b 100644
> > > --- a/hw/vfio/pci.h
> > > +++ b/hw/vfio/pci.h
> > > @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> > >  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> > >  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> > >  
> > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> > > +
> > >  void vfio_display_reset(VFIOPCIDevice *vdev);
> > >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> > >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > > index b1ef55a33f..ed64e477db 100644
> > > --- a/hw/vfio/trace-events
> > > +++ b/hw/vfio/trace-events
> > > @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
> > >  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> > >  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> > >  
> > > +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> > > +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> > > +
> > >  # common.c
> > >  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> > >  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64  



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-13 17:55         ` Alex Williamson
  0 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-13 17:55 UTC (permalink / raw)
  To: Derrick, Jonathan
  Cc: hch, linux-pci, lorenzo.pieralisi, helgaas, virtualization,
	qemu-devel, andrzej.jakowski

On Wed, 13 May 2020 00:35:47 +0000
"Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:

> Hi Alex,
> 
> I'm probably not getting the translation technical details correct.
> 
> On Mon, 2020-05-11 at 16:59 -0600, Alex Williamson wrote:
> > On Mon, 11 May 2020 15:01:27 -0400
> > Jon Derrick <jonathan.derrick@intel.com> wrote:
> >   
> > > The VMD endpoint provides a real PCIe domain to the guest, including  
> > 
> > Please define VMD.  I'm sure this is obvious to many, but I've had to
> > do some research.  The best TL;DR summary I've found is Keith's
> > original commit 185a383ada2e adding the controller to Linux.  If there's
> > something better, please let me know.  
> That's the correct commit, but I'll try to summarize the important bits
> for v3.
> 
> >   
> > > bridges and endpoints. Because the VMD domain is enumerated by the guest
> > > kernel, the guest kernel will assign Guest Physical Addresses to the
> > > downstream endpoint BARs and bridge windows.
> > > 
> > > When the guest kernel performs MMIO to VMD sub-devices, IOMMU will
> > > translate from the guest address space to the physical address space.
> > > Because the bridges have been programmed with guest addresses, the
> > > bridges will reject the transaction containing physical addresses.  
> > 
> > I'm lost, what IOMMU is involved in CPU access to MMIO space?  My guess
> > is that since all MMIO of this domain is mapped behind the host
> > endpoint BARs 2 & 4 that QEMU simply accesses it via mapping of those
> > BARs into the VM, so it's the MMU, not the IOMMU performing those GPA
> > to HPA translations.  But then presumably the bridges within the domain
> > are scrambled because their apertures are programmed with ranges that
> > don't map into the VMD endpoint BARs.  Is that remotely correct?  Some
> > /proc/iomem output and/or lspci listing from the host to see how this
> > works would be useful.  
> Correct. So MMU not IOMMU.
> 
> In the guest kernel, the bridges and devices in the VMD domain are
> programmed with the addresses provided in the VMD endpoint's BAR2&4
> (MEMBAR1&2). Because these BARs are populated with guest addresses, MMU
> translates to host physical and the bridge window rejects MMIO not in
> its [GPA] range.
> 
> As an example:
> Host:
>   94000000-97ffffff : 0000:17:05.5
>     94000000-97ffffff : VMD MEMBAR1
>       94000000-943fffff : PCI Bus 10000:01
>         94000000-9400ffff : 10000:01:00.0
>         94010000-94013fff : 10000:01:00.0
>           94010000-94013fff : nvme
>       94400000-947fffff : PCI Bus 10000:01
>       94800000-94bfffff : PCI Bus 10000:02
>         94800000-9480ffff : 10000:02:00.0
>         94810000-94813fff : 10000:02:00.0
>           94810000-94813fff : nvme
>       94c00000-94ffffff : PCI Bus 10000:02
> 
> 
> MEMBAR 2 is similarly assigned
> 
> >   
> > > VMD device 28C0 natively assists passthrough by providing the Host
> > > Physical Address in shadow registers accessible to the guest for bridge
> > > window assignment. The shadow registers are valid if bit 1 is set in VMD
> > > VMLOCK config register 0x70. Future VMDs will also support this feature.
> > > Existing VMDs have config register 0x70 reserved, and will return 0 on
> > > reads.  
> > 
> > So these shadow registers are simply exposing the host BAR2 & BAR4
> > addresses into the guest, so the quirk is dependent on reading those
> > values from the device before anyone has written to them and the BAR
> > emulation in the kernel kicks in (not a problem, just an observation).  
> It's not expected that there will be anything writing that resource and
> those registers are read-only.
> The first 0x2000 of MEMBAR2 (BAR4) contain msix tables, and mappings to
> subordinate buses are on 1MB aligned.
> 
> 
> > Does the VMD controller code then use these bases addresses to program
> > the bridges/endpoint within the domain?  What does the same /proc/iomem
> > or lspci look like inside the guest then?  It seems like we'd see the
> > VMD endpoint with GPA BARs, but the devices within the domain using
> > HPAs.  If that's remotely true, and we're not forcing an identity
> > mapping of this HPA range into the GPA, does the vmd controller driver
> > impose a TRA function on these MMIO addresses in the guest?  
> 
> This is the guest with the guest addresses:
>   f8000000-fbffffff : 0000:00:07.0
>     f8000000-fbffffff : VMD MEMBAR1
>   
>     f8000000-f83fffff : PCI Bus 10000:01
>         f8000000-f800ffff :
> 10000:01:00.0
>         f8010000-f8013fff : 10000:01:00.0
>           f801000
> 0-f8013fff : nvme
>       f8400000-f87fffff : PCI Bus 10000:01
>       f88000
> 00-f8bfffff : PCI Bus 10000:02
>         f8800000-f880ffff : 10000:02:00.0
>         f8810000-f8813fff : 10000:02:00.0
>           f8810000-f8813fff :
> nvme
>       f8c00000-f8ffffff : PCI Bus 10000:02
> 
> 
> The VMD guest driver does the translation on the supplied address using
> pci_add_resource_offset prior to creating the root bus and enumerating
> the domain:
> 
> 	offset[0] = vmd->dev->resource[VMD_MEMBAR1].start -
> 			readq(membar2 + MB2_SHADOW_OFFSET);
> 	offset[1] = vmd->dev->resource[VMD_MEMBAR2].start -
> 			readq(membar2 + MB2_SHADOW_OFFSET + 8);
> ...
> 	pci_add_resource(&resources, &vmd->resources[0]);
> 	pci_add_resource_offset(&resources, &vmd->resources[1], offset[0]);
> 	pci_add_resource_offset(&resources, &vmd->resources[2], offset[1]);
> 
> 	vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
> 				       &vmd_ops, sd, &resources);
> 
> 
> The offset becomes the CPU-to-bridge translation function for
> programming the guest's VMD domain with bus addresses.
> 
> 
> In the patched guest's lspci, for the bridge you see:
> # lspci -v -s 10000:00:02.0
> 10000:00:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C
> 	...
> 	Memory behind bridge: 94000000-943fffff
> 
> 
> But the kernel doesn't export the offset BAR for the endpoint:
> # lspci -v -s 10000:02:00.0
> 10000:02:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD
> 	...
> 	Memory at f8810000 (64-bit, non-prefetchable) [size=16K]
> 	[virtual] Expansion ROM at f8800000 [disabled] [size=64K]
> 
> The endpoint is still programmed with the offset:
> # setpci -s 10000:02:00.0 10.l
> 94810004
> 
> 
> > 
> > Sorry if I'm way off, I'm piecing things together from scant
> > information here.  Please Cc me on future vfio related patches.  Thanks,
> > 
> > Alex
> >   
> No problem

Thanks for the confirmation.  The approach seems ok, but a subtle
side-effect of MemoryRegion quirks is that we're going to trap the
entire PAGE_SIZE range overlapping the quirk out to QEMU on guest
access.  The two registers at 0x2000 might be reserved for this
purpose, but is there anything else that could be performance sensitive
anywhere in that page?  If so, it might be less transparent, but more
efficient to invent a new quirk making use of config space (ex. adding
an emulated vendor capability somewhere known to be unused on the
device).  Thanks,

Alex

> > > In order to support existing VMDs, this quirk emulates the VMLOCK and
> > > HPA shadow registers for all VMD device ids which don't natively assist
> > > with passthrough. The Linux VMD driver is updated to allow existing VMD
> > > devices to query VMLOCK for passthrough support.
> > > 
> > > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > > ---
> > >  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c        |   7 +++
> > >  hw/vfio/pci.h        |   2 +
> > >  hw/vfio/trace-events |   3 ++
> > >  4 files changed, 115 insertions(+)
> > > 
> > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > index 2d348f8237..4060a6a95d 100644
> > > --- a/hw/vfio/pci-quirks.c
> > > +++ b/hw/vfio/pci-quirks.c
> > > @@ -1709,3 +1709,106 @@ free_exit:
> > >  
> > >      return ret;
> > >  }
> > > +
> > > +/*
> > > + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> > > + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> > > + * to VMD sub-devices go through IOMMU translation from guest addresses to
> > > + * physical addresses. When MMIO goes to an endpoint after being translated to
> > > + * physical addresses, the bridge rejects the transaction because the window
> > > + * has been programmed with guest addresses.
> > > + *
> > > + * VMD can use the Host Physical Address in order to correctly program the
> > > + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> > > + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> > > + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> > > + * this native assistance can have these registers safely emulated as these
> > > + * registers are reserved.
> > > + */
> > > +typedef struct VFIOVMDQuirk {
> > > +    VFIOPCIDevice *vdev;
> > > +    uint64_t membar_phys[2];
> > > +} VFIOVMDQuirk;
> > > +
> > > +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> > > +{
> > > +    VFIOVMDQuirk *data = opaque;
> > > +    uint64_t val = 0;
> > > +
> > > +    memcpy(&val, (void *)data->membar_phys + addr, size);
> > > +    return val;
> > > +}
> > > +
> > > +static const MemoryRegionOps vfio_vmd_quirk = {
> > > +    .read = vfio_vmd_quirk_read,
> > > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > > +};
> > > +
> > > +#define VMD_VMLOCK  0x70
> > > +#define VMD_SHADOW  0x2000
> > > +#define VMD_MEMBAR2 4
> > > +
> > > +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> > > +{
> > > +    VFIOQuirk *quirk;
> > > +    VFIOVMDQuirk *data;
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    int ret;
> > > +
> > > +    data = g_malloc0(sizeof(*data));
> > > +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> > > +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> > > +    if (ret != 16) {
> > > +        error_report("VMD %s cannot read MEMBARs (%d)",
> > > +                     vdev->vbasedev.name, ret);
> > > +        g_free(data);
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    quirk = vfio_quirk_alloc(1);
> > > +    quirk->data = data;
> > > +    data->vdev = vdev;
> > > +
> > > +    /* Emulate Shadow Registers */
> > > +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> > > +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> > > +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> > > +                                        VMD_SHADOW, quirk->mem, 1);
> > > +    memory_region_set_readonly(quirk->mem, true);
> > > +    memory_region_set_enabled(quirk->mem, true);
> > > +
> > > +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> > > +
> > > +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> > > +                                         data->membar_phys[0],
> > > +                                         data->membar_phys[1]);
> > > +
> > > +    /* Advertise Shadow Register support */
> > > +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> > > +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> > > +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> > > +
> > > +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> > > +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> > > +{
> > > +    int ret = 0;
> > > +
> > > +    switch (vdev->device_id) {
> > > +    case 0x28C0: /* Native passthrough support */
> > > +        break;
> > > +    /* Emulates Native passthrough support */
> > > +    case 0x201D:
> > > +    case 0x467F:
> > > +    case 0x4C3D:
> > > +    case 0x9A0B:
> > > +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> > > +        break;
> > > +    }
> > > +
> > > +    return ret;
> > > +}
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 5e75a95129..85425a1a6f 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> > >          }
> > >      }
> > >  
> > > +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> > > +        ret = vfio_pci_vmd_init(vdev);
> > > +        if (ret) {
> > > +            error_report("Failed to setup VMD");
> > > +        }
> > > +    }
> > > +
> > >      vfio_register_err_notifier(vdev);
> > >      vfio_register_req_notifier(vdev);
> > >      vfio_setup_resetfn_quirk(vdev);
> > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > index 0da7a20a7e..e8632d806b 100644
> > > --- a/hw/vfio/pci.h
> > > +++ b/hw/vfio/pci.h
> > > @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> > >  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> > >  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> > >  
> > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> > > +
> > >  void vfio_display_reset(VFIOPCIDevice *vdev);
> > >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> > >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > > index b1ef55a33f..ed64e477db 100644
> > > --- a/hw/vfio/trace-events
> > > +++ b/hw/vfio/trace-events
> > > @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
> > >  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> > >  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> > >  
> > > +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> > > +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> > > +
> > >  # common.c
> > >  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> > >  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
  2020-05-13 17:55         ` Alex Williamson
@ 2020-05-13 19:26           ` Derrick, Jonathan
  -1 siblings, 0 replies; 25+ messages in thread
From: Derrick, Jonathan @ 2020-05-13 19:26 UTC (permalink / raw)
  To: alex.williamson
  Cc: hch, linux-pci, lorenzo.pieralisi, helgaas, virtualization,
	qemu-devel, andrzej.jakowski

On Wed, 2020-05-13 at 11:55 -0600, Alex Williamson wrote:
> On Wed, 13 May 2020 00:35:47 +0000
> "Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:
> 
> > Hi Alex,
> > 
> > 
[snip]

> 
> Thanks for the confirmation.  The approach seems ok, but a subtle
> side-effect of MemoryRegion quirks is that we're going to trap the
> entire PAGE_SIZE range overlapping the quirk out to QEMU on guest
> access.  The two registers at 0x2000 might be reserved for this
> purpose, but is there anything else that could be performance sensitive
> anywhere in that page?  If so, it might be less transparent, but more
> efficient to invent a new quirk making use of config space (ex. adding
> an emulated vendor capability somewhere known to be unused on the
> device).  Thanks,
> 
> Alex

Seems like that could be a problem if we're running with huge pages and
overlapping msix tables.
 
Gerd also recommended adding a vendor specific capability. I don't see
that there's any vendor-specific capabilities we can accidentally
shadow, so I'll go ahead and do that.

Thanks,
Jon

> 
> > > > In order to support existing VMDs, this quirk emulates the VMLOCK and
> > > > HPA shadow registers for all VMD device ids which don't natively assist
> > > > with passthrough. The Linux VMD driver is updated to allow existing VMD
> > > > devices to query VMLOCK for passthrough support.
> > > > 
> > > > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > > > ---
> > > >  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c        |   7 +++
> > > >  hw/vfio/pci.h        |   2 +
> > > >  hw/vfio/trace-events |   3 ++
> > > >  4 files changed, 115 insertions(+)
> > > > 
> > > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > > index 2d348f8237..4060a6a95d 100644
> > > > --- a/hw/vfio/pci-quirks.c
> > > > +++ b/hw/vfio/pci-quirks.c
> > > > @@ -1709,3 +1709,106 @@ free_exit:
> > > >  
> > > >      return ret;
> > > >  }
> > > > +
> > > > +/*
> > > > + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> > > > + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> > > > + * to VMD sub-devices go through IOMMU translation from guest addresses to
> > > > + * physical addresses. When MMIO goes to an endpoint after being translated to
> > > > + * physical addresses, the bridge rejects the transaction because the window
> > > > + * has been programmed with guest addresses.
> > > > + *
> > > > + * VMD can use the Host Physical Address in order to correctly program the
> > > > + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> > > > + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> > > > + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> > > > + * this native assistance can have these registers safely emulated as these
> > > > + * registers are reserved.
> > > > + */
> > > > +typedef struct VFIOVMDQuirk {
> > > > +    VFIOPCIDevice *vdev;
> > > > +    uint64_t membar_phys[2];
> > > > +} VFIOVMDQuirk;
> > > > +
> > > > +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> > > > +{
> > > > +    VFIOVMDQuirk *data = opaque;
> > > > +    uint64_t val = 0;
> > > > +
> > > > +    memcpy(&val, (void *)data->membar_phys + addr, size);
> > > > +    return val;
> > > > +}
> > > > +
> > > > +static const MemoryRegionOps vfio_vmd_quirk = {
> > > > +    .read = vfio_vmd_quirk_read,
> > > > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > > > +};
> > > > +
> > > > +#define VMD_VMLOCK  0x70
> > > > +#define VMD_SHADOW  0x2000
> > > > +#define VMD_MEMBAR2 4
> > > > +
> > > > +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> > > > +{
> > > > +    VFIOQuirk *quirk;
> > > > +    VFIOVMDQuirk *data;
> > > > +    PCIDevice *pdev = &vdev->pdev;
> > > > +    int ret;
> > > > +
> > > > +    data = g_malloc0(sizeof(*data));
> > > > +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> > > > +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> > > > +    if (ret != 16) {
> > > > +        error_report("VMD %s cannot read MEMBARs (%d)",
> > > > +                     vdev->vbasedev.name, ret);
> > > > +        g_free(data);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    quirk = vfio_quirk_alloc(1);
> > > > +    quirk->data = data;
> > > > +    data->vdev = vdev;
> > > > +
> > > > +    /* Emulate Shadow Registers */
> > > > +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> > > > +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> > > > +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> > > > +                                        VMD_SHADOW, quirk->mem, 1);
> > > > +    memory_region_set_readonly(quirk->mem, true);
> > > > +    memory_region_set_enabled(quirk->mem, true);
> > > > +
> > > > +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> > > > +
> > > > +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> > > > +                                         data->membar_phys[0],
> > > > +                                         data->membar_phys[1]);
> > > > +
> > > > +    /* Advertise Shadow Register support */
> > > > +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> > > > +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> > > > +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> > > > +
> > > > +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> > > > +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> > > > +{
> > > > +    int ret = 0;
> > > > +
> > > > +    switch (vdev->device_id) {
> > > > +    case 0x28C0: /* Native passthrough support */
> > > > +        break;
> > > > +    /* Emulates Native passthrough support */
> > > > +    case 0x201D:
> > > > +    case 0x467F:
> > > > +    case 0x4C3D:
> > > > +    case 0x9A0B:
> > > > +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> > > > +        break;
> > > > +    }
> > > > +
> > > > +    return ret;
> > > > +}
> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > index 5e75a95129..85425a1a6f 100644
> > > > --- a/hw/vfio/pci.c
> > > > +++ b/hw/vfio/pci.c
> > > > @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> > > >          }
> > > >      }
> > > >  
> > > > +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> > > > +        ret = vfio_pci_vmd_init(vdev);
> > > > +        if (ret) {
> > > > +            error_report("Failed to setup VMD");
> > > > +        }
> > > > +    }
> > > > +
> > > >      vfio_register_err_notifier(vdev);
> > > >      vfio_register_req_notifier(vdev);
> > > >      vfio_setup_resetfn_quirk(vdev);
> > > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > > index 0da7a20a7e..e8632d806b 100644
> > > > --- a/hw/vfio/pci.h
> > > > +++ b/hw/vfio/pci.h
> > > > @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> > > >  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> > > >  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> > > >  
> > > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> > > > +
> > > >  void vfio_display_reset(VFIOPCIDevice *vdev);
> > > >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> > > >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > > > index b1ef55a33f..ed64e477db 100644
> > > > --- a/hw/vfio/trace-events
> > > > +++ b/hw/vfio/trace-events
> > > > @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
> > > >  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> > > >  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> > > >  
> > > > +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> > > > +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> > > > +
> > > >  # common.c
> > > >  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> > > >  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-13 19:26           ` Derrick, Jonathan
  0 siblings, 0 replies; 25+ messages in thread
From: Derrick, Jonathan @ 2020-05-13 19:26 UTC (permalink / raw)
  To: alex.williamson
  Cc: lorenzo.pieralisi, linux-pci, qemu-devel, virtualization,
	andrzej.jakowski, helgaas, hch

On Wed, 2020-05-13 at 11:55 -0600, Alex Williamson wrote:
> On Wed, 13 May 2020 00:35:47 +0000
> "Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:
> 
> > Hi Alex,
> > 
> > 
[snip]

> 
> Thanks for the confirmation.  The approach seems ok, but a subtle
> side-effect of MemoryRegion quirks is that we're going to trap the
> entire PAGE_SIZE range overlapping the quirk out to QEMU on guest
> access.  The two registers at 0x2000 might be reserved for this
> purpose, but is there anything else that could be performance sensitive
> anywhere in that page?  If so, it might be less transparent, but more
> efficient to invent a new quirk making use of config space (ex. adding
> an emulated vendor capability somewhere known to be unused on the
> device).  Thanks,
> 
> Alex

Seems like that could be a problem if we're running with huge pages and
overlapping msix tables.
 
Gerd also recommended adding a vendor specific capability. I don't see
that there's any vendor-specific capabilities we can accidentally
shadow, so I'll go ahead and do that.

Thanks,
Jon

> 
> > > > In order to support existing VMDs, this quirk emulates the VMLOCK and
> > > > HPA shadow registers for all VMD device ids which don't natively assist
> > > > with passthrough. The Linux VMD driver is updated to allow existing VMD
> > > > devices to query VMLOCK for passthrough support.
> > > > 
> > > > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > > > ---
> > > >  hw/vfio/pci-quirks.c | 103 +++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c        |   7 +++
> > > >  hw/vfio/pci.h        |   2 +
> > > >  hw/vfio/trace-events |   3 ++
> > > >  4 files changed, 115 insertions(+)
> > > > 
> > > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > > index 2d348f8237..4060a6a95d 100644
> > > > --- a/hw/vfio/pci-quirks.c
> > > > +++ b/hw/vfio/pci-quirks.c
> > > > @@ -1709,3 +1709,106 @@ free_exit:
> > > >  
> > > >      return ret;
> > > >  }
> > > > +
> > > > +/*
> > > > + * The VMD endpoint provides a real PCIe domain to the guest and the guest
> > > > + * kernel performs enumeration of the VMD sub-device domain. Guest transactions
> > > > + * to VMD sub-devices go through IOMMU translation from guest addresses to
> > > > + * physical addresses. When MMIO goes to an endpoint after being translated to
> > > > + * physical addresses, the bridge rejects the transaction because the window
> > > > + * has been programmed with guest addresses.
> > > > + *
> > > > + * VMD can use the Host Physical Address in order to correctly program the
> > > > + * bridge windows in its PCIe domain. VMD device 28C0 has HPA shadow registers
> > > > + * located at offset 0x2000 in MEMBAR2 (BAR 4). The shadow registers are valid
> > > > + * if bit 1 is set in the VMD VMLOCK config register 0x70. VMD devices without
> > > > + * this native assistance can have these registers safely emulated as these
> > > > + * registers are reserved.
> > > > + */
> > > > +typedef struct VFIOVMDQuirk {
> > > > +    VFIOPCIDevice *vdev;
> > > > +    uint64_t membar_phys[2];
> > > > +} VFIOVMDQuirk;
> > > > +
> > > > +static uint64_t vfio_vmd_quirk_read(void *opaque, hwaddr addr, unsigned size)
> > > > +{
> > > > +    VFIOVMDQuirk *data = opaque;
> > > > +    uint64_t val = 0;
> > > > +
> > > > +    memcpy(&val, (void *)data->membar_phys + addr, size);
> > > > +    return val;
> > > > +}
> > > > +
> > > > +static const MemoryRegionOps vfio_vmd_quirk = {
> > > > +    .read = vfio_vmd_quirk_read,
> > > > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > > > +};
> > > > +
> > > > +#define VMD_VMLOCK  0x70
> > > > +#define VMD_SHADOW  0x2000
> > > > +#define VMD_MEMBAR2 4
> > > > +
> > > > +static int vfio_vmd_emulate_shadow_registers(VFIOPCIDevice *vdev)
> > > > +{
> > > > +    VFIOQuirk *quirk;
> > > > +    VFIOVMDQuirk *data;
> > > > +    PCIDevice *pdev = &vdev->pdev;
> > > > +    int ret;
> > > > +
> > > > +    data = g_malloc0(sizeof(*data));
> > > > +    ret = pread(vdev->vbasedev.fd, data->membar_phys, 16,
> > > > +                vdev->config_offset + PCI_BASE_ADDRESS_2);
> > > > +    if (ret != 16) {
> > > > +        error_report("VMD %s cannot read MEMBARs (%d)",
> > > > +                     vdev->vbasedev.name, ret);
> > > > +        g_free(data);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    quirk = vfio_quirk_alloc(1);
> > > > +    quirk->data = data;
> > > > +    data->vdev = vdev;
> > > > +
> > > > +    /* Emulate Shadow Registers */
> > > > +    memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_vmd_quirk, data,
> > > > +                          "vfio-vmd-quirk", sizeof(data->membar_phys));
> > > > +    memory_region_add_subregion_overlap(vdev->bars[VMD_MEMBAR2].region.mem,
> > > > +                                        VMD_SHADOW, quirk->mem, 1);
> > > > +    memory_region_set_readonly(quirk->mem, true);
> > > > +    memory_region_set_enabled(quirk->mem, true);
> > > > +
> > > > +    QLIST_INSERT_HEAD(&vdev->bars[VMD_MEMBAR2].quirks, quirk, next);
> > > > +
> > > > +    trace_vfio_pci_vmd_quirk_shadow_regs(vdev->vbasedev.name,
> > > > +                                         data->membar_phys[0],
> > > > +                                         data->membar_phys[1]);
> > > > +
> > > > +    /* Advertise Shadow Register support */
> > > > +    pci_byte_test_and_set_mask(pdev->config + VMD_VMLOCK, 0x2);
> > > > +    pci_set_byte(pdev->wmask + VMD_VMLOCK, 0);
> > > > +    pci_set_byte(vdev->emulated_config_bits + VMD_VMLOCK, 0x2);
> > > > +
> > > > +    trace_vfio_pci_vmd_quirk_vmlock(vdev->vbasedev.name,
> > > > +                                    pci_get_byte(pdev->config + VMD_VMLOCK));
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev)
> > > > +{
> > > > +    int ret = 0;
> > > > +
> > > > +    switch (vdev->device_id) {
> > > > +    case 0x28C0: /* Native passthrough support */
> > > > +        break;
> > > > +    /* Emulates Native passthrough support */
> > > > +    case 0x201D:
> > > > +    case 0x467F:
> > > > +    case 0x4C3D:
> > > > +    case 0x9A0B:
> > > > +        ret = vfio_vmd_emulate_shadow_registers(vdev);
> > > > +        break;
> > > > +    }
> > > > +
> > > > +    return ret;
> > > > +}
> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > index 5e75a95129..85425a1a6f 100644
> > > > --- a/hw/vfio/pci.c
> > > > +++ b/hw/vfio/pci.c
> > > > @@ -3024,6 +3024,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> > > >          }
> > > >      }
> > > >  
> > > > +    if (vdev->vendor_id == PCI_VENDOR_ID_INTEL) {
> > > > +        ret = vfio_pci_vmd_init(vdev);
> > > > +        if (ret) {
> > > > +            error_report("Failed to setup VMD");
> > > > +        }
> > > > +    }
> > > > +
> > > >      vfio_register_err_notifier(vdev);
> > > >      vfio_register_req_notifier(vdev);
> > > >      vfio_setup_resetfn_quirk(vdev);
> > > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > > index 0da7a20a7e..e8632d806b 100644
> > > > --- a/hw/vfio/pci.h
> > > > +++ b/hw/vfio/pci.h
> > > > @@ -217,6 +217,8 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> > > >  int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> > > >  int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> > > >  
> > > > +int vfio_pci_vmd_init(VFIOPCIDevice *vdev);
> > > > +
> > > >  void vfio_display_reset(VFIOPCIDevice *vdev);
> > > >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> > > >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > > > index b1ef55a33f..ed64e477db 100644
> > > > --- a/hw/vfio/trace-events
> > > > +++ b/hw/vfio/trace-events
> > > > @@ -90,6 +90,9 @@ vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "
> > > >  vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> > > >  vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> > > >  
> > > > +vfio_pci_vmd_quirk_shadow_regs(const char *name, uint64_t mb1, uint64_t mb2) "%s membar1_phys=0x%"PRIx64" membar2_phys=0x%"PRIx64
> > > > +vfio_pci_vmd_quirk_vmlock(const char *name, uint8_t vmlock) "%s vmlock=0x%x"
> > > > +
> > > >  # common.c
> > > >  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> > > >  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
  2020-05-13 19:26           ` Derrick, Jonathan
  (?)
@ 2020-05-13 19:55             ` Alex Williamson
  -1 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-13 19:55 UTC (permalink / raw)
  To: Derrick, Jonathan
  Cc: hch, linux-pci, lorenzo.pieralisi, helgaas, virtualization,
	qemu-devel, andrzej.jakowski

On Wed, 13 May 2020 19:26:34 +0000
"Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:

> On Wed, 2020-05-13 at 11:55 -0600, Alex Williamson wrote:
> > On Wed, 13 May 2020 00:35:47 +0000
> > "Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:
> >   
> > > Hi Alex,
> > > 
> > >   
> [snip]
> 
> > 
> > Thanks for the confirmation.  The approach seems ok, but a subtle
> > side-effect of MemoryRegion quirks is that we're going to trap the
> > entire PAGE_SIZE range overlapping the quirk out to QEMU on guest
> > access.  The two registers at 0x2000 might be reserved for this
> > purpose, but is there anything else that could be performance sensitive
> > anywhere in that page?  If so, it might be less transparent, but more
> > efficient to invent a new quirk making use of config space (ex. adding
> > an emulated vendor capability somewhere known to be unused on the
> > device).  Thanks,
> > 
> > Alex  
> 
> Seems like that could be a problem if we're running with huge pages and
> overlapping msix tables.

FWIW, MSI-X tables are already getting trapped into QEMU for emulation.
We have a mechanism via x-msix-relocation=<OffAutoPCIBAR> in QEMU to
deal with that when it becomes a problem by emulating the MSI-X
structures in MMIO space that doesn't overlap with the actual device
(ie. virtually resizing or adding BARs).  The issue is what else can be
in that 4K page or will this device be supported on archs where the
system page size is >4K more so than huge pages (as in hugetlbfs or
transparent huge pages).  Thanks,

Alex


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-13 19:55             ` Alex Williamson
  0 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-13 19:55 UTC (permalink / raw)
  To: Derrick, Jonathan
  Cc: lorenzo.pieralisi, linux-pci, qemu-devel, virtualization,
	andrzej.jakowski, helgaas, hch

On Wed, 13 May 2020 19:26:34 +0000
"Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:

> On Wed, 2020-05-13 at 11:55 -0600, Alex Williamson wrote:
> > On Wed, 13 May 2020 00:35:47 +0000
> > "Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:
> >   
> > > Hi Alex,
> > > 
> > >   
> [snip]
> 
> > 
> > Thanks for the confirmation.  The approach seems ok, but a subtle
> > side-effect of MemoryRegion quirks is that we're going to trap the
> > entire PAGE_SIZE range overlapping the quirk out to QEMU on guest
> > access.  The two registers at 0x2000 might be reserved for this
> > purpose, but is there anything else that could be performance sensitive
> > anywhere in that page?  If so, it might be less transparent, but more
> > efficient to invent a new quirk making use of config space (ex. adding
> > an emulated vendor capability somewhere known to be unused on the
> > device).  Thanks,
> > 
> > Alex  
> 
> Seems like that could be a problem if we're running with huge pages and
> overlapping msix tables.

FWIW, MSI-X tables are already getting trapped into QEMU for emulation.
We have a mechanism via x-msix-relocation=<OffAutoPCIBAR> in QEMU to
deal with that when it becomes a problem by emulating the MSI-X
structures in MMIO space that doesn't overlap with the actual device
(ie. virtually resizing or adding BARs).  The issue is what else can be
in that 4K page or will this device be supported on archs where the
system page size is >4K more so than huge pages (as in hugetlbfs or
transparent huge pages).  Thanks,

Alex



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk
@ 2020-05-13 19:55             ` Alex Williamson
  0 siblings, 0 replies; 25+ messages in thread
From: Alex Williamson @ 2020-05-13 19:55 UTC (permalink / raw)
  To: Derrick, Jonathan
  Cc: hch, linux-pci, lorenzo.pieralisi, helgaas, virtualization,
	qemu-devel, andrzej.jakowski

On Wed, 13 May 2020 19:26:34 +0000
"Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:

> On Wed, 2020-05-13 at 11:55 -0600, Alex Williamson wrote:
> > On Wed, 13 May 2020 00:35:47 +0000
> > "Derrick, Jonathan" <jonathan.derrick@intel.com> wrote:
> >   
> > > Hi Alex,
> > > 
> > >   
> [snip]
> 
> > 
> > Thanks for the confirmation.  The approach seems ok, but a subtle
> > side-effect of MemoryRegion quirks is that we're going to trap the
> > entire PAGE_SIZE range overlapping the quirk out to QEMU on guest
> > access.  The two registers at 0x2000 might be reserved for this
> > purpose, but is there anything else that could be performance sensitive
> > anywhere in that page?  If so, it might be less transparent, but more
> > efficient to invent a new quirk making use of config space (ex. adding
> > an emulated vendor capability somewhere known to be unused on the
> > device).  Thanks,
> > 
> > Alex  
> 
> Seems like that could be a problem if we're running with huge pages and
> overlapping msix tables.

FWIW, MSI-X tables are already getting trapped into QEMU for emulation.
We have a mechanism via x-msix-relocation=<OffAutoPCIBAR> in QEMU to
deal with that when it becomes a problem by emulating the MSI-X
structures in MMIO space that doesn't overlap with the actual device
(ie. virtually resizing or adding BARs).  The issue is what else can be
in that 4K page or will this device be supported on archs where the
system page size is >4K more so than huge pages (as in hugetlbfs or
transparent huge pages).  Thanks,

Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-05-13 19:55 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11 19:01 [PATCH v2 0/2] VMD endpoint passthrough support Jon Derrick
2020-05-11 19:01 ` Jon Derrick
2020-05-11 19:01 ` Jon Derrick
2020-05-11 19:01 ` [PATCH for QEMU v2] hw/vfio: Add VMD Passthrough Quirk Jon Derrick
2020-05-11 19:01   ` Jon Derrick
2020-05-11 19:01   ` Jon Derrick
2020-05-11 22:59   ` Alex Williamson
2020-05-11 22:59     ` Alex Williamson
2020-05-11 22:59     ` Alex Williamson
2020-05-13  0:35     ` Derrick, Jonathan
2020-05-13  0:35       ` Derrick, Jonathan
2020-05-13 17:55       ` Alex Williamson
2020-05-13 17:55         ` Alex Williamson
2020-05-13 17:55         ` Alex Williamson
2020-05-13 19:26         ` Derrick, Jonathan
2020-05-13 19:26           ` Derrick, Jonathan
2020-05-13 19:55           ` Alex Williamson
2020-05-13 19:55             ` Alex Williamson
2020-05-13 19:55             ` Alex Williamson
2020-05-11 19:01 ` [PATCH v2 1/2] PCI: vmd: Filter resource type bits from shadow register Jon Derrick
2020-05-11 19:01   ` Jon Derrick
2020-05-11 19:01   ` Jon Derrick
2020-05-11 19:01 ` [PATCH v2 2/2] PCI: vmd: Use Shadow MEMBAR registers for QEMU/KVM guests Jon Derrick
2020-05-11 19:01   ` Jon Derrick
2020-05-11 19:01   ` Jon Derrick

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.