linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2016-04-27 12:43 Yongji Xie
  2016-04-27 12:43 ` [PATCH 1/5] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag Yongji Xie
                   ` (4 more replies)
  0 siblings, 5 replies; 39+ messages in thread
From: Yongji Xie @ 2016-04-27 12:43 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur, Yongji Xie

Current vfio-pci implementation disallows to mmap the page 
containing MSI-X table in case that users can write to MSI-X
table and generate an incorrect MSIs.

However, this will cause some performance issue when there
are some critical device registers in the same page as the 
MSI-X table. We have to handle the mmio access to these 
registers in QEMU emulation rather than in guest.

To solve this issue, this series allows to mmap MSI-X table 
when hardware supports the capability of interrupt remapping 
which can ensure that a given pci device can only shoot the 
MSIs assigned for it. And we introduce a new bus_flags 
PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side 
for different archs.

The patch 3 are based on the proposed patchset[1].

[1] http://www.spinics.net/lists/kvm/msg130256.html

Yongji Xie (5):
  PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag
  iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping
  PCI: Set PCI_BUS_FLAGS_MSI_REMAP if MSI controller supports IRQ remapping
  pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
  vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported

 arch/powerpc/platforms/powernv/pci-ioda.c |    8 ++++++++
 drivers/iommu/iommu.c                     |   15 +++++++++++++++
 drivers/pci/msi.c                         |   12 ++++++++++++
 drivers/pci/probe.c                       |    3 +++
 drivers/vfio/pci/vfio_pci.c               |    7 +++++--
 drivers/vfio/pci/vfio_pci_rdwr.c          |    3 ++-
 include/linux/msi.h                       |    6 +++++-
 include/linux/pci.h                       |    1 +
 8 files changed, 51 insertions(+), 4 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/5] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag
  2016-04-27 12:43 [PATCH 0/5] vfio-pci: Add support for mmapping MSI-X table Yongji Xie
@ 2016-04-27 12:43 ` Yongji Xie
  2016-05-24 20:55   ` Bjorn Helgaas
  2016-04-27 12:43 ` [PATCH 2/5] iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping Yongji Xie
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Yongji Xie @ 2016-04-27 12:43 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur, Yongji Xie

We introduce a new pci_bus_flags, PCI_BUS_FLAGS_MSI_REMAP
which indicates all devices on the bus are protected by the
hardware which supports IRQ remapping(intel naming).

This flag will be used to know whether it's safe to expose
MSI-X tables of PCI BARs to userspace. Because the capability
of IRQ remapping can guarantee the PCI device cannot trigger
MSIs that correspond to interrupt IDs of other devices.

There is a existing flag for this in the IOMMU space:

enum iommu_cap {
	IOMMU_CAP_CACHE_COHERENCY,
--->	IOMMU_CAP_INTR_REMAP,
	IOMMU_CAP_NOEXEC,
};

and Eric also posted a patchset [1] to abstract this
capability on MSI controller side for ARM. But it would
make sense to have a more common flag like
PCI_BUS_FLAGS_MSI_REMAP in this patch so that we can use
a universal flag to test this capability on PCI side for
different archs.

[1] http://www.spinics.net/lists/kvm/msg130256.html

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 include/linux/pci.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 27df4a6..d619228 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -193,6 +193,7 @@ typedef unsigned short __bitwise pci_bus_flags_t;
 enum pci_bus_flags {
 	PCI_BUS_FLAGS_NO_MSI   = (__force pci_bus_flags_t) 1,
 	PCI_BUS_FLAGS_NO_MMRBC = (__force pci_bus_flags_t) 2,
+	PCI_BUS_FLAGS_MSI_REMAP = (__force pci_bus_flags_t) 4,
 };
 
 /* These values come from the PCI Express Spec */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/5] iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping
  2016-04-27 12:43 [PATCH 0/5] vfio-pci: Add support for mmapping MSI-X table Yongji Xie
  2016-04-27 12:43 ` [PATCH 1/5] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag Yongji Xie
@ 2016-04-27 12:43 ` Yongji Xie
  2016-05-24 21:11   ` Bjorn Helgaas
  2016-04-27 12:43 ` [PATCH 3/5] PCI: Set PCI_BUS_FLAGS_MSI_REMAP if MSI controller supports " Yongji Xie
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Yongji Xie @ 2016-04-27 12:43 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur, Yongji Xie

The capability of IRQ remapping is abstracted on IOMMU side on
some archs. There is a existing flag IOMMU_CAP_INTR_REMAP for this.

To have a universal flag to test this capability for different
archs on PCI side, we set PCI_BUS_FLAGS_MSI_REMAP for PCI buses
when IOMMU_CAP_INTR_REMAP is set.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/iommu/iommu.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0e3b009..5d2b6f6 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -813,6 +813,16 @@ struct iommu_group *pci_device_group(struct device *dev)
 	return group;
 }
 
+static void pci_check_msi_remapping(struct pci_dev *pdev,
+					const struct iommu_ops *ops)
+{
+	struct pci_bus *bus = pdev->bus;
+
+	if (ops->capable(IOMMU_CAP_INTR_REMAP) &&
+		!(bus->bus_flags & PCI_BUS_FLAGS_MSI_REMAP))
+		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
+}
+
 /**
  * iommu_group_get_for_dev - Find or create the IOMMU group for a device
  * @dev: target device
@@ -871,6 +881,9 @@ static int add_iommu_group(struct device *dev, void *data)
 	const struct iommu_ops *ops = cb->ops;
 	int ret;
 
+	if (dev_is_pci(dev) && ops->capable)
+		pci_check_msi_remapping(to_pci_dev(dev), ops);
+
 	if (!ops->add_device)
 		return 0;
 
@@ -913,6 +926,8 @@ static int iommu_bus_notifier(struct notifier_block *nb,
 	 * result in ADD/DEL notifiers to group->notifier
 	 */
 	if (action == BUS_NOTIFY_ADD_DEVICE) {
+		if (dev_is_pci(dev) && ops->capable)
+			pci_check_msi_remapping(to_pci_dev(dev), ops);
 		if (ops->add_device)
 			return ops->add_device(dev);
 	} else if (action == BUS_NOTIFY_REMOVED_DEVICE) {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/5] PCI: Set PCI_BUS_FLAGS_MSI_REMAP if MSI controller supports IRQ remapping
  2016-04-27 12:43 [PATCH 0/5] vfio-pci: Add support for mmapping MSI-X table Yongji Xie
  2016-04-27 12:43 ` [PATCH 1/5] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag Yongji Xie
  2016-04-27 12:43 ` [PATCH 2/5] iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping Yongji Xie
@ 2016-04-27 12:43 ` Yongji Xie
  2016-05-24 21:04   ` Bjorn Helgaas
  2016-04-27 12:43 ` [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge Yongji Xie
  2016-04-27 12:43 ` [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported Yongji Xie
  4 siblings, 1 reply; 39+ messages in thread
From: Yongji Xie @ 2016-04-27 12:43 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur, Yongji Xie

On ARM HW the capability of IRQ remapping is abstracted on
MSI controller side. MSI_FLAG_IRQ_REMAPPING is used to advertise
this [1].

To have a universal flag to test this capability for different
archs on PCI side, we set PCI_BUS_FLAGS_MSI_REMAP for PCI buses
when MSI_FLAG_IRQ_REMAPPING is set.

[1] http://www.spinics.net/lists/kvm/msg130256.html

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/pci/msi.c   |   12 ++++++++++++
 drivers/pci/probe.c |    3 +++
 include/linux/msi.h |    6 +++++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a080f44..1661cdf 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1134,6 +1134,18 @@ void *msi_desc_to_pci_sysdata(struct msi_desc *desc)
 }
 EXPORT_SYMBOL_GPL(msi_desc_to_pci_sysdata);
 
+void pci_bus_check_msi_remapping(struct pci_bus *bus,
+				 struct irq_domain *domain)
+{
+#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
+	struct msi_domain_info *info;
+
+	info = msi_get_domain_info(domain);
+	if (info->flags & MSI_FLAG_IRQ_REMAPPING)
+		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
+#endif
+}
+
 #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
 /**
  * pci_msi_domain_write_msg - Helper to write MSI message to PCI config space
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 6d7ab9b..25cf1b1 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -696,6 +696,9 @@ static void pci_set_bus_msi_domain(struct pci_bus *bus)
 	if (!d)
 		d = pci_host_bridge_msi_domain(b);
 
+	if (d && b == bus)
+		pci_bus_check_msi_remapping(bus, d);
+
 	dev_set_msi_domain(&bus->dev, d);
 }
 
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 03eda72..b4c649e 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -15,6 +15,8 @@ extern int pci_msi_ignore_mask;
 struct irq_data;
 struct msi_desc;
 struct pci_dev;
+struct pci_bus;
+struct irq_domain;
 struct platform_msi_priv_data;
 void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg);
@@ -155,6 +157,9 @@ void arch_restore_msi_irqs(struct pci_dev *dev);
 void default_teardown_msi_irqs(struct pci_dev *dev);
 void default_restore_msi_irqs(struct pci_dev *dev);
 
+void pci_bus_check_msi_remapping(struct pci_bus *bus,
+				 struct irq_domain *domain);
+
 struct msi_controller {
 	struct module *owner;
 	struct device *dev;
@@ -173,7 +178,6 @@ struct msi_controller {
 #include <linux/irqhandler.h>
 #include <asm/msi.h>
 
-struct irq_domain;
 struct irq_domain_ops;
 struct irq_chip;
 struct device_node;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
  2016-04-27 12:43 [PATCH 0/5] vfio-pci: Add support for mmapping MSI-X table Yongji Xie
                   ` (2 preceding siblings ...)
  2016-04-27 12:43 ` [PATCH 3/5] PCI: Set PCI_BUS_FLAGS_MSI_REMAP if MSI controller supports " Yongji Xie
@ 2016-04-27 12:43 ` Yongji Xie
  2016-05-06  6:34   ` Alexey Kardashevskiy
  2016-04-27 12:43 ` [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported Yongji Xie
  4 siblings, 1 reply; 39+ messages in thread
From: Yongji Xie @ 2016-04-27 12:43 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur, Yongji Xie

Any IODA host bridge have the capability of IRQ remapping.
So we set PCI_BUS_FLAGS_MSI_REMAP when this kind of host birdge
is detected.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f90dc04..9557638 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3080,6 +3080,12 @@ static void pnv_pci_ioda_fixup(void)
 	pnv_npu_ioda_fixup();
 }
 
+int pnv_pci_ioda_root_bridge_prepare(struct pci_host_bridge *bridge)
+{
+	bridge->bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
+	return 0;
+}
+
 /*
  * Returns the alignment for I/O or memory windows for P2P
  * bridges. That actually depends on how PEs are segmented.
@@ -3364,6 +3370,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	 */
 	ppc_md.pcibios_fixup = pnv_pci_ioda_fixup;
 
+	ppc_md.pcibios_root_bridge_prepare = pnv_pci_ioda_root_bridge_prepare;
+
 	if (phb->type == PNV_PHB_NPU)
 		hose->controller_ops = pnv_npu_ioda_controller_ops;
 	else
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-04-27 12:43 [PATCH 0/5] vfio-pci: Add support for mmapping MSI-X table Yongji Xie
                   ` (3 preceding siblings ...)
  2016-04-27 12:43 ` [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge Yongji Xie
@ 2016-04-27 12:43 ` Yongji Xie
  2016-05-03  5:34   ` Tian, Kevin
  4 siblings, 1 reply; 39+ messages in thread
From: Yongji Xie @ 2016-04-27 12:43 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur, Yongji Xie

This patch enables mmapping MSI-X tables if hardware supports
interrupt remapping which can ensure that a given pci device
can only shoot the MSIs assigned for it.

With MSI-X table mmapped, we also need to expose the
read/write interface which will be used to access MSI-X table.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/vfio/pci/vfio_pci.c      |    7 +++++--
 drivers/vfio/pci/vfio_pci_rdwr.c |    3 ++-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 98059df..7eba77d 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -591,7 +591,9 @@ static long vfio_pci_ioctl(void *device_data,
 			    pci_resource_flags(pdev, info.index) &
 			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
-				if (info.index == vdev->msix_bar) {
+				if (info.index == vdev->msix_bar &&
+					!(pdev->bus->bus_flags &
+					PCI_BUS_FLAGS_MSI_REMAP)) {
 					ret = msix_sparse_mmap_cap(vdev, &caps);
 					if (ret)
 						return ret;
@@ -1023,7 +1025,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
 		return -EINVAL;
 
-	if (index == vdev->msix_bar) {
+	if (index == vdev->msix_bar &&
+		!(pdev->bus->bus_flags & PCI_BUS_FLAGS_MSI_REMAP)) {
 		/*
 		 * Disallow mmaps overlapping the MSI-X table; users don't
 		 * get to touch this directly.  We could find somewhere
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 5ffd1d9..dbf9cd0 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -164,7 +164,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	} else
 		io = vdev->barmap[bar];
 
-	if (bar == vdev->msix_bar) {
+	if (bar == vdev->msix_bar &&
+		!(pdev->bus->bus_flags & PCI_BUS_FLAGS_MSI_REMAP)) {
 		x_start = vdev->msix_offset;
 		x_end = vdev->msix_offset + vdev->msix_size;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-04-27 12:43 ` [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported Yongji Xie
@ 2016-05-03  5:34   ` Tian, Kevin
  2016-05-03  6:08     ` Yongji Xie
  0 siblings, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2016-05-03  5:34 UTC (permalink / raw)
  To: Yongji Xie, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

> From: Yongji Xie
> Sent: Wednesday, April 27, 2016 8:43 PM
> 
> This patch enables mmapping MSI-X tables if hardware supports
> interrupt remapping which can ensure that a given pci device
> can only shoot the MSIs assigned for it.
> 
> With MSI-X table mmapped, we also need to expose the
> read/write interface which will be used to access MSI-X table.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>

A curious question here. Does "allow to mmap MSI-X" essentially
mean that KVM guest can directly read/write physical MSI-X
structure then?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-03  5:34   ` Tian, Kevin
@ 2016-05-03  6:08     ` Yongji Xie
  2016-05-03  6:22       ` Tian, Kevin
  0 siblings, 1 reply; 39+ messages in thread
From: Yongji Xie @ 2016-05-03  6:08 UTC (permalink / raw)
  To: Tian, Kevin, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On 2016/5/3 13:34, Tian, Kevin wrote:

>> From: Yongji Xie
>> Sent: Wednesday, April 27, 2016 8:43 PM
>>
>> This patch enables mmapping MSI-X tables if hardware supports
>> interrupt remapping which can ensure that a given pci device
>> can only shoot the MSIs assigned for it.
>>
>> With MSI-X table mmapped, we also need to expose the
>> read/write interface which will be used to access MSI-X table.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> A curious question here. Does "allow to mmap MSI-X" essentially
> mean that KVM guest can directly read/write physical MSI-X
> structure then?
>
> Thanks
> Kevin
>

Here we just allow to mmap MSI-X table in kernel. It doesn't
mean all KVM guest can directly read/write physical MSI-X
structure. This should be decided by QEMU. For PPC64
platform, we would allow to passthrough the MSI-X table
because we know guest kernel would not write physical
MSI-X structure when enabling MSI.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-03  6:08     ` Yongji Xie
@ 2016-05-03  6:22       ` Tian, Kevin
  2016-05-03  7:34         ` Yongji Xie
  0 siblings, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2016-05-03  6:22 UTC (permalink / raw)
  To: Yongji Xie, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

> From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> Sent: Tuesday, May 03, 2016 2:08 PM
> 
> On 2016/5/3 13:34, Tian, Kevin wrote:
> 
> >> From: Yongji Xie
> >> Sent: Wednesday, April 27, 2016 8:43 PM
> >>
> >> This patch enables mmapping MSI-X tables if hardware supports
> >> interrupt remapping which can ensure that a given pci device
> >> can only shoot the MSIs assigned for it.
> >>
> >> With MSI-X table mmapped, we also need to expose the
> >> read/write interface which will be used to access MSI-X table.
> >>
> >> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> > A curious question here. Does "allow to mmap MSI-X" essentially
> > mean that KVM guest can directly read/write physical MSI-X
> > structure then?
> >
> > Thanks
> > Kevin
> >
> 
> Here we just allow to mmap MSI-X table in kernel. It doesn't
> mean all KVM guest can directly read/write physical MSI-X
> structure. This should be decided by QEMU. For PPC64
> platform, we would allow to passthrough the MSI-X table
> because we know guest kernel would not write physical
> MSI-X structure when enabling MSI.
> 

A bit confused here. If guest kernel doesn't need to write
physical MSI-X structure, what's the point of passing through
the table then?

I think the key whether MSI-X table can be passed through
is related to where hypervisor control is deployed. At least
for x86:

- When irq remapping is not enabled, host/hypervisor needs
to control physical interrupt message including vector/dest/etc.
directly in MSI-X structure, so we cannot allow a guest to 
access it;

- when irq remapping is enabled, host/hypervisor can control
interrupt routing in irq remapping table. However MSI-X
also needs to be configured as remappable format. In this
manner we also cannot allow direct access from guest.

The only sane case to pass through MSI-X structure, is a
mechanism similar to irq remapping but w/o need to change
original MSI-X format so direct access from guest side is
safe. Is it the case in PPC64?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-03  6:22       ` Tian, Kevin
@ 2016-05-03  7:34         ` Yongji Xie
  2016-05-05  9:36           ` Tian, Kevin
  0 siblings, 1 reply; 39+ messages in thread
From: Yongji Xie @ 2016-05-03  7:34 UTC (permalink / raw)
  To: Tian, Kevin, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On 2016/5/3 14:22, Tian, Kevin wrote:

>> From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
>> Sent: Tuesday, May 03, 2016 2:08 PM
>>
>> On 2016/5/3 13:34, Tian, Kevin wrote:
>>
>>>> From: Yongji Xie
>>>> Sent: Wednesday, April 27, 2016 8:43 PM
>>>>
>>>> This patch enables mmapping MSI-X tables if hardware supports
>>>> interrupt remapping which can ensure that a given pci device
>>>> can only shoot the MSIs assigned for it.
>>>>
>>>> With MSI-X table mmapped, we also need to expose the
>>>> read/write interface which will be used to access MSI-X table.
>>>>
>>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>>> A curious question here. Does "allow to mmap MSI-X" essentially
>>> mean that KVM guest can directly read/write physical MSI-X
>>> structure then?
>>>
>>> Thanks
>>> Kevin
>>>
>> Here we just allow to mmap MSI-X table in kernel. It doesn't
>> mean all KVM guest can directly read/write physical MSI-X
>> structure. This should be decided by QEMU. For PPC64
>> platform, we would allow to passthrough the MSI-X table
>> because we know guest kernel would not write physical
>> MSI-X structure when enabling MSI.
>>
> A bit confused here. If guest kernel doesn't need to write
> physical MSI-X structure, what's the point of passing through
> the table then?

We want to allow the MSI-X table because there may be
some critical registers in the same page as the MSI-X table.
We have to handle the mmio access to these register in QEMU
rather than in guest if mmapping MSI-X table is disallowed.

> I think the key whether MSI-X table can be passed through
> is related to where hypervisor control is deployed. At least
> for x86:
>
> - When irq remapping is not enabled, host/hypervisor needs
> to control physical interrupt message including vector/dest/etc.
> directly in MSI-X structure, so we cannot allow a guest to
> access it;
>
> - when irq remapping is enabled, host/hypervisor can control
> interrupt routing in irq remapping table. However MSI-X
> also needs to be configured as remappable format. In this
> manner we also cannot allow direct access from guest.
>
> The only sane case to pass through MSI-X structure, is a
> mechanism similar to irq remapping but w/o need to change
> original MSI-X format so direct access from guest side is
> safe. Is it the case in PPC64?
>
> Thanks
> Kevin

Acutually, we are not aimed at accessing MSI-X table from
guest. So I think it's safe to passthrough MSI-X table if we
can make sure guest kernel would not touch MSI-X table in
normal code path such as para-virtualized guest kernel on PPC64.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-03  7:34         ` Yongji Xie
@ 2016-05-05  9:36           ` Tian, Kevin
  2016-05-05  9:54             ` David Laight
  2016-05-05 11:44             ` Yongji Xie
  0 siblings, 2 replies; 39+ messages in thread
From: Tian, Kevin @ 2016-05-05  9:36 UTC (permalink / raw)
  To: Yongji Xie, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

> From: Yongji Xie
> Sent: Tuesday, May 03, 2016 3:34 PM
> 
> On 2016/5/3 14:22, Tian, Kevin wrote:
> 
> >> From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> >> Sent: Tuesday, May 03, 2016 2:08 PM
> >>
> >> On 2016/5/3 13:34, Tian, Kevin wrote:
> >>
> >>>> From: Yongji Xie
> >>>> Sent: Wednesday, April 27, 2016 8:43 PM
> >>>>
> >>>> This patch enables mmapping MSI-X tables if hardware supports
> >>>> interrupt remapping which can ensure that a given pci device
> >>>> can only shoot the MSIs assigned for it.
> >>>>
> >>>> With MSI-X table mmapped, we also need to expose the
> >>>> read/write interface which will be used to access MSI-X table.
> >>>>
> >>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> >>> A curious question here. Does "allow to mmap MSI-X" essentially
> >>> mean that KVM guest can directly read/write physical MSI-X
> >>> structure then?
> >>>
> >>> Thanks
> >>> Kevin
> >>>
> >> Here we just allow to mmap MSI-X table in kernel. It doesn't
> >> mean all KVM guest can directly read/write physical MSI-X
> >> structure. This should be decided by QEMU. For PPC64
> >> platform, we would allow to passthrough the MSI-X table
> >> because we know guest kernel would not write physical
> >> MSI-X structure when enabling MSI.
> >>
> > A bit confused here. If guest kernel doesn't need to write
> > physical MSI-X structure, what's the point of passing through
> > the table then?
> 
> We want to allow the MSI-X table because there may be
> some critical registers in the same page as the MSI-X table.
> We have to handle the mmio access to these register in QEMU
> rather than in guest if mmapping MSI-X table is disallowed.

So you mean critical registers in same MMIO BAR as MSI-X
table, instead of two MMIO BARs in same page (the latter I
suppose with your whole patchset it won't happen then)?

> 
> > I think the key whether MSI-X table can be passed through
> > is related to where hypervisor control is deployed. At least
> > for x86:
> >
> > - When irq remapping is not enabled, host/hypervisor needs
> > to control physical interrupt message including vector/dest/etc.
> > directly in MSI-X structure, so we cannot allow a guest to
> > access it;
> >
> > - when irq remapping is enabled, host/hypervisor can control
> > interrupt routing in irq remapping table. However MSI-X
> > also needs to be configured as remappable format. In this
> > manner we also cannot allow direct access from guest.
> >
> > The only sane case to pass through MSI-X structure, is a
> > mechanism similar to irq remapping but w/o need to change
> > original MSI-X format so direct access from guest side is
> > safe. Is it the case in PPC64?
> >
> > Thanks
> > Kevin
> 
> Acutually, we are not aimed at accessing MSI-X table from
> guest. So I think it's safe to passthrough MSI-X table if we
> can make sure guest kernel would not touch MSI-X table in
> normal code path such as para-virtualized guest kernel on PPC64.
> 

Then how do you prevent malicious guest kernel accessing it?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-05  9:36           ` Tian, Kevin
@ 2016-05-05  9:54             ` David Laight
  2016-05-05 11:42               ` Yongji Xie
  2016-05-05 11:44             ` Yongji Xie
  1 sibling, 1 reply; 39+ messages in thread
From: David Laight @ 2016-05-05  9:54 UTC (permalink / raw)
  To: 'Tian, Kevin',
	Yongji Xie, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, alistair, ruscur

From: Tian, Kevin
> Sent: 05 May 2016 10:37
...
> > Acutually, we are not aimed at accessing MSI-X table from
> > guest. So I think it's safe to passthrough MSI-X table if we
> > can make sure guest kernel would not touch MSI-X table in
> > normal code path such as para-virtualized guest kernel on PPC64.
> >
> 
> Then how do you prevent malicious guest kernel accessing it?

Or a malicious guest driver for an ethernet card setting up
the receive buffer ring to contain a single word entry that
contains the address associated with an MSI-X interrupt and
then using a loopback mode to cause a specific packet be
received that writes the required word through that address.

Remember the PCIe cycle for an interrupt is a normal memory write
cycle.

	David

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-05  9:54             ` David Laight
@ 2016-05-05 11:42               ` Yongji Xie
  2016-05-05 12:15                 ` Tian, Kevin
  0 siblings, 1 reply; 39+ messages in thread
From: Yongji Xie @ 2016-05-05 11:42 UTC (permalink / raw)
  To: David Laight, 'Tian, Kevin',
	kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, alistair, ruscur

Hi David and Kevin,

On 2016/5/5 17:54, David Laight wrote:

> From: Tian, Kevin
>> Sent: 05 May 2016 10:37
> ...
>>> Acutually, we are not aimed at accessing MSI-X table from
>>> guest. So I think it's safe to passthrough MSI-X table if we
>>> can make sure guest kernel would not touch MSI-X table in
>>> normal code path such as para-virtualized guest kernel on PPC64.
>>>
>> Then how do you prevent malicious guest kernel accessing it?
> Or a malicious guest driver for an ethernet card setting up
> the receive buffer ring to contain a single word entry that
> contains the address associated with an MSI-X interrupt and
> then using a loopback mode to cause a specific packet be
> received that writes the required word through that address.
>
> Remember the PCIe cycle for an interrupt is a normal memory write
> cycle.
>
> 	David
>

If we have enough permission to load a malicious driver or
kernel, we can easily break the guest without exposed
MSI-X table.

I think it should be safe to expose MSI-X table if we can
make sure that malicious guest driver/kernel can't use
the MSI-X table to break other guest or host. The
capability of IRQ remapping could provide this
kind of protection.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-05  9:36           ` Tian, Kevin
  2016-05-05  9:54             ` David Laight
@ 2016-05-05 11:44             ` Yongji Xie
  1 sibling, 0 replies; 39+ messages in thread
From: Yongji Xie @ 2016-05-05 11:44 UTC (permalink / raw)
  To: Tian, Kevin, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On 2016/5/5 17:36, Tian, Kevin wrote:

>> From: Yongji Xie
>> Sent: Tuesday, May 03, 2016 3:34 PM
>>
>> On 2016/5/3 14:22, Tian, Kevin wrote:
>>
>>>> From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
>>>> Sent: Tuesday, May 03, 2016 2:08 PM
>>>>
>>>> On 2016/5/3 13:34, Tian, Kevin wrote:
>>>>
>>>>>> From: Yongji Xie
>>>>>> Sent: Wednesday, April 27, 2016 8:43 PM
>>>>>>
>>>>>> This patch enables mmapping MSI-X tables if hardware supports
>>>>>> interrupt remapping which can ensure that a given pci device
>>>>>> can only shoot the MSIs assigned for it.
>>>>>>
>>>>>> With MSI-X table mmapped, we also need to expose the
>>>>>> read/write interface which will be used to access MSI-X table.
>>>>>>
>>>>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>>>>> A curious question here. Does "allow to mmap MSI-X" essentially
>>>>> mean that KVM guest can directly read/write physical MSI-X
>>>>> structure then?
>>>>>
>>>>> Thanks
>>>>> Kevin
>>>>>
>>>> Here we just allow to mmap MSI-X table in kernel. It doesn't
>>>> mean all KVM guest can directly read/write physical MSI-X
>>>> structure. This should be decided by QEMU. For PPC64
>>>> platform, we would allow to passthrough the MSI-X table
>>>> because we know guest kernel would not write physical
>>>> MSI-X structure when enabling MSI.
>>>>
>>> A bit confused here. If guest kernel doesn't need to write
>>> physical MSI-X structure, what's the point of passing through
>>> the table then?
>> We want to allow the MSI-X table because there may be
>> some critical registers in the same page as the MSI-X table.
>> We have to handle the mmio access to these register in QEMU
>> rather than in guest if mmapping MSI-X table is disallowed.
> So you mean critical registers in same MMIO BAR as MSI-X
> table, instead of two MMIO BARs in same page (the latter I
> suppose with your whole patchset it won't happen then)?

Yes. That's what I mean!

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-05 11:42               ` Yongji Xie
@ 2016-05-05 12:15                 ` Tian, Kevin
  2016-05-05 13:28                   ` Yongji Xie
  2016-05-05 15:05                   ` Alex Williamson
  0 siblings, 2 replies; 39+ messages in thread
From: Tian, Kevin @ 2016-05-05 12:15 UTC (permalink / raw)
  To: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, alistair, ruscur

> From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> Sent: Thursday, May 05, 2016 7:43 PM
> 
> Hi David and Kevin,
> 
> On 2016/5/5 17:54, David Laight wrote:
> 
> > From: Tian, Kevin
> >> Sent: 05 May 2016 10:37
> > ...
> >>> Acutually, we are not aimed at accessing MSI-X table from
> >>> guest. So I think it's safe to passthrough MSI-X table if we
> >>> can make sure guest kernel would not touch MSI-X table in
> >>> normal code path such as para-virtualized guest kernel on PPC64.
> >>>
> >> Then how do you prevent malicious guest kernel accessing it?
> > Or a malicious guest driver for an ethernet card setting up
> > the receive buffer ring to contain a single word entry that
> > contains the address associated with an MSI-X interrupt and
> > then using a loopback mode to cause a specific packet be
> > received that writes the required word through that address.
> >
> > Remember the PCIe cycle for an interrupt is a normal memory write
> > cycle.
> >
> > 	David
> >
> 
> If we have enough permission to load a malicious driver or
> kernel, we can easily break the guest without exposed
> MSI-X table.
> 
> I think it should be safe to expose MSI-X table if we can
> make sure that malicious guest driver/kernel can't use
> the MSI-X table to break other guest or host. The
> capability of IRQ remapping could provide this
> kind of protection.
> 

With IRQ remapping it doesn't mean you can pass through MSI-X
structure to guest. I know actual IRQ remapping might be platform
specific, but at least for Intel VT-d specification, MSI-X entry must
be configured with a remappable format by host kernel which
contains an index into IRQ remapping table. The index will find a
IRQ remapping entry which controls interrupt routing for a specific
device. If you allow a malicious program random index into MSI-X 
entry of assigned device, the hole is obvious...

Above might make sense only for a IRQ remapping implementation 
which doesn't rely on extended MSI-X format (e.g. simply based on 
BDF). If that's the case for PPC, then you should build MSI-X 
passthrough based on this fact instead of general IRQ remapping 
enabled or not.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-05 12:15                 ` Tian, Kevin
@ 2016-05-05 13:28                   ` Yongji Xie
  2016-05-05 15:05                   ` Alex Williamson
  1 sibling, 0 replies; 39+ messages in thread
From: Yongji Xie @ 2016-05-05 13:28 UTC (permalink / raw)
  To: Tian, Kevin, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, alistair, ruscur

On 2016/5/5 20:15, Tian, Kevin wrote:

>> From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
>> Sent: Thursday, May 05, 2016 7:43 PM
>>
>> Hi David and Kevin,
>>
>> On 2016/5/5 17:54, David Laight wrote:
>>
>>> From: Tian, Kevin
>>>> Sent: 05 May 2016 10:37
>>> ...
>>>>> Acutually, we are not aimed at accessing MSI-X table from
>>>>> guest. So I think it's safe to passthrough MSI-X table if we
>>>>> can make sure guest kernel would not touch MSI-X table in
>>>>> normal code path such as para-virtualized guest kernel on PPC64.
>>>>>
>>>> Then how do you prevent malicious guest kernel accessing it?
>>> Or a malicious guest driver for an ethernet card setting up
>>> the receive buffer ring to contain a single word entry that
>>> contains the address associated with an MSI-X interrupt and
>>> then using a loopback mode to cause a specific packet be
>>> received that writes the required word through that address.
>>>
>>> Remember the PCIe cycle for an interrupt is a normal memory write
>>> cycle.
>>>
>>> 	David
>>>
>> If we have enough permission to load a malicious driver or
>> kernel, we can easily break the guest without exposed
>> MSI-X table.
>>
>> I think it should be safe to expose MSI-X table if we can
>> make sure that malicious guest driver/kernel can't use
>> the MSI-X table to break other guest or host. The
>> capability of IRQ remapping could provide this
>> kind of protection.
>>
> With IRQ remapping it doesn't mean you can pass through MSI-X
> structure to guest. I know actual IRQ remapping might be platform
> specific, but at least for Intel VT-d specification, MSI-X entry must
> be configured with a remappable format by host kernel which
> contains an index into IRQ remapping table. The index will find a
> IRQ remapping entry which controls interrupt routing for a specific
> device. If you allow a malicious program random index into MSI-X
> entry of assigned device, the hole is obvious...

Do you mean we can trigger MSIs that correspond to interrupt
IDs of other devices by writing to MSI-X table although IRQ
remapping is enabled?

On PPC64, there is a mapping between MSIs and PE num
which can be used to identify a PCI device on PHB. So the
hardware can ensure a given pci device can only shoot the
MSIs assigned for it.  Isn't there a similar mapping in IRQ
remapping table on Intel.

Thanks,
Yongji

> Above might make sense only for a IRQ remapping implementation
> which doesn't rely on extended MSI-X format (e.g. simply based on
> BDF). If that's the case for PPC, then you should build MSI-X
> passthrough based on this fact instead of general IRQ remapping
> enabled or not.
>
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-05 12:15                 ` Tian, Kevin
  2016-05-05 13:28                   ` Yongji Xie
@ 2016-05-05 15:05                   ` Alex Williamson
  2016-05-06  6:35                     ` Alexey Kardashevskiy
  2016-05-11  6:29                     ` Tian, Kevin
  1 sibling, 2 replies; 39+ messages in thread
From: Alex Williamson @ 2016-05-05 15:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

On Thu, 5 May 2016 12:15:46 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > Sent: Thursday, May 05, 2016 7:43 PM
> > 
> > Hi David and Kevin,
> > 
> > On 2016/5/5 17:54, David Laight wrote:
> >   
> > > From: Tian, Kevin  
> > >> Sent: 05 May 2016 10:37  
> > > ...  
> > >>> Acutually, we are not aimed at accessing MSI-X table from
> > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > >>> can make sure guest kernel would not touch MSI-X table in
> > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > >>>  
> > >> Then how do you prevent malicious guest kernel accessing it?  
> > > Or a malicious guest driver for an ethernet card setting up
> > > the receive buffer ring to contain a single word entry that
> > > contains the address associated with an MSI-X interrupt and
> > > then using a loopback mode to cause a specific packet be
> > > received that writes the required word through that address.
> > >
> > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > cycle.
> > >
> > > 	David
> > >  
> > 
> > If we have enough permission to load a malicious driver or
> > kernel, we can easily break the guest without exposed
> > MSI-X table.
> > 
> > I think it should be safe to expose MSI-X table if we can
> > make sure that malicious guest driver/kernel can't use
> > the MSI-X table to break other guest or host. The
> > capability of IRQ remapping could provide this
> > kind of protection.
> >   
> 
> With IRQ remapping it doesn't mean you can pass through MSI-X
> structure to guest. I know actual IRQ remapping might be platform
> specific, but at least for Intel VT-d specification, MSI-X entry must
> be configured with a remappable format by host kernel which
> contains an index into IRQ remapping table. The index will find a
> IRQ remapping entry which controls interrupt routing for a specific
> device. If you allow a malicious program random index into MSI-X 
> entry of assigned device, the hole is obvious...
> 
> Above might make sense only for a IRQ remapping implementation 
> which doesn't rely on extended MSI-X format (e.g. simply based on 
> BDF). If that's the case for PPC, then you should build MSI-X 
> passthrough based on this fact instead of general IRQ remapping 
> enabled or not.

I don't think anyone is expecting that we can expose the MSI-X vector
table to the guest and the guest can make direct use of it.  The end
goal here is that the guest on a power system is already
paravirtualized to not program the device MSI-X by directly writing to
the MSI-X vector table.  They have hypercalls for this since they
always run virtualized.  Therefore a) they never intend to touch the
MSI-X vector table and b) they have sufficient isolation that a guest
can only hurt itself by doing so.

On x86 we don't have a), our method of programming the MSI-X vector
table is to directly write to it. Therefore we will always require QEMU
to place a MemoryRegion over the vector table to intercept those
accesses.  However with interrupt remapping, we do have b) on x86, which
means that we don't need to be so strict in disallowing user accesses
to the MSI-X vector table.  It's not useful for configuring MSI-X on
the device, but the user should only be able to hurt themselves by
writing it directly.  x86 doesn't really get anything out of this
change, but it helps this special case on power pretty significantly
aiui.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
  2016-04-27 12:43 ` [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge Yongji Xie
@ 2016-05-06  6:34   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 39+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-06  6:34 UTC (permalink / raw)
  To: Yongji Xie, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu
  Cc: alex.williamson, bhelgaas, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On 04/27/2016 10:43 PM, Yongji Xie wrote:
> Any IODA host bridge have the capability of IRQ remapping.
> So we set PCI_BUS_FLAGS_MSI_REMAP when this kind of host birdge
> is detected.
>
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>


Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f90dc04..9557638 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -3080,6 +3080,12 @@ static void pnv_pci_ioda_fixup(void)
>  	pnv_npu_ioda_fixup();
>  }
>
> +int pnv_pci_ioda_root_bridge_prepare(struct pci_host_bridge *bridge)
> +{
> +	bridge->bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
> +	return 0;
> +}
> +
>  /*
>   * Returns the alignment for I/O or memory windows for P2P
>   * bridges. That actually depends on how PEs are segmented.
> @@ -3364,6 +3370,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	 */
>  	ppc_md.pcibios_fixup = pnv_pci_ioda_fixup;
>
> +	ppc_md.pcibios_root_bridge_prepare = pnv_pci_ioda_root_bridge_prepare;
> +
>  	if (phb->type == PNV_PHB_NPU)
>  		hose->controller_ops = pnv_npu_ioda_controller_ops;
>  	else
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-05 15:05                   ` Alex Williamson
@ 2016-05-06  6:35                     ` Alexey Kardashevskiy
  2016-05-06 16:54                       ` Alex Williamson
  2016-05-11  6:29                     ` Tian, Kevin
  1 sibling, 1 reply; 39+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-06  6:35 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, alistair, ruscur

On 05/06/2016 01:05 AM, Alex Williamson wrote:
> On Thu, 5 May 2016 12:15:46 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>>> From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
>>> Sent: Thursday, May 05, 2016 7:43 PM
>>>
>>> Hi David and Kevin,
>>>
>>> On 2016/5/5 17:54, David Laight wrote:
>>>
>>>> From: Tian, Kevin
>>>>> Sent: 05 May 2016 10:37
>>>> ...
>>>>>> Acutually, we are not aimed at accessing MSI-X table from
>>>>>> guest. So I think it's safe to passthrough MSI-X table if we
>>>>>> can make sure guest kernel would not touch MSI-X table in
>>>>>> normal code path such as para-virtualized guest kernel on PPC64.
>>>>>>
>>>>> Then how do you prevent malicious guest kernel accessing it?
>>>> Or a malicious guest driver for an ethernet card setting up
>>>> the receive buffer ring to contain a single word entry that
>>>> contains the address associated with an MSI-X interrupt and
>>>> then using a loopback mode to cause a specific packet be
>>>> received that writes the required word through that address.
>>>>
>>>> Remember the PCIe cycle for an interrupt is a normal memory write
>>>> cycle.
>>>>
>>>> 	David
>>>>
>>>
>>> If we have enough permission to load a malicious driver or
>>> kernel, we can easily break the guest without exposed
>>> MSI-X table.
>>>
>>> I think it should be safe to expose MSI-X table if we can
>>> make sure that malicious guest driver/kernel can't use
>>> the MSI-X table to break other guest or host. The
>>> capability of IRQ remapping could provide this
>>> kind of protection.
>>>
>>
>> With IRQ remapping it doesn't mean you can pass through MSI-X
>> structure to guest. I know actual IRQ remapping might be platform
>> specific, but at least for Intel VT-d specification, MSI-X entry must
>> be configured with a remappable format by host kernel which
>> contains an index into IRQ remapping table. The index will find a
>> IRQ remapping entry which controls interrupt routing for a specific
>> device. If you allow a malicious program random index into MSI-X
>> entry of assigned device, the hole is obvious...
>>
>> Above might make sense only for a IRQ remapping implementation
>> which doesn't rely on extended MSI-X format (e.g. simply based on
>> BDF). If that's the case for PPC, then you should build MSI-X
>> passthrough based on this fact instead of general IRQ remapping
>> enabled or not.
>
> I don't think anyone is expecting that we can expose the MSI-X vector
> table to the guest and the guest can make direct use of it.  The end
> goal here is that the guest on a power system is already
> paravirtualized to not program the device MSI-X by directly writing to
> the MSI-X vector table.  They have hypercalls for this since they
> always run virtualized.  Therefore a) they never intend to touch the
> MSI-X vector table and b) they have sufficient isolation that a guest
> can only hurt itself by doing so.
>
> On x86 we don't have a), our method of programming the MSI-X vector
> table is to directly write to it. Therefore we will always require QEMU
> to place a MemoryRegion over the vector table to intercept those
> accesses.  However with interrupt remapping, we do have b) on x86, which
> means that we don't need to be so strict in disallowing user accesses
> to the MSI-X vector table.  It's not useful for configuring MSI-X on
> the device, but the user should only be able to hurt themselves by
> writing it directly.  x86 doesn't really get anything out of this
> change, but it helps this special case on power pretty significantly
> aiui.  Thanks,

Excellent short overview, saved :)

How do we proceed with these patches? Nobody seems objecting them but also 
nobody seems taking them either...




-- 
Alexey

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-06  6:35                     ` Alexey Kardashevskiy
@ 2016-05-06 16:54                       ` Alex Williamson
  0 siblings, 0 replies; 39+ messages in thread
From: Alex Williamson @ 2016-05-06 16:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Tian, Kevin, Yongji Xie, David Laight, kvm, linux-kernel,
	linux-pci, linuxppc-dev, iommu, bhelgaas, benh, paulus, mpe,
	joro, warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

On Fri, 6 May 2016 16:35:38 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 05/06/2016 01:05 AM, Alex Williamson wrote:
> > On Thu, 5 May 2016 12:15:46 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >  
> >>> From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> >>> Sent: Thursday, May 05, 2016 7:43 PM
> >>>
> >>> Hi David and Kevin,
> >>>
> >>> On 2016/5/5 17:54, David Laight wrote:
> >>>  
> >>>> From: Tian, Kevin  
> >>>>> Sent: 05 May 2016 10:37  
> >>>> ...  
> >>>>>> Acutually, we are not aimed at accessing MSI-X table from
> >>>>>> guest. So I think it's safe to passthrough MSI-X table if we
> >>>>>> can make sure guest kernel would not touch MSI-X table in
> >>>>>> normal code path such as para-virtualized guest kernel on PPC64.
> >>>>>>  
> >>>>> Then how do you prevent malicious guest kernel accessing it?  
> >>>> Or a malicious guest driver for an ethernet card setting up
> >>>> the receive buffer ring to contain a single word entry that
> >>>> contains the address associated with an MSI-X interrupt and
> >>>> then using a loopback mode to cause a specific packet be
> >>>> received that writes the required word through that address.
> >>>>
> >>>> Remember the PCIe cycle for an interrupt is a normal memory write
> >>>> cycle.
> >>>>
> >>>> 	David
> >>>>  
> >>>
> >>> If we have enough permission to load a malicious driver or
> >>> kernel, we can easily break the guest without exposed
> >>> MSI-X table.
> >>>
> >>> I think it should be safe to expose MSI-X table if we can
> >>> make sure that malicious guest driver/kernel can't use
> >>> the MSI-X table to break other guest or host. The
> >>> capability of IRQ remapping could provide this
> >>> kind of protection.
> >>>  
> >>
> >> With IRQ remapping it doesn't mean you can pass through MSI-X
> >> structure to guest. I know actual IRQ remapping might be platform
> >> specific, but at least for Intel VT-d specification, MSI-X entry must
> >> be configured with a remappable format by host kernel which
> >> contains an index into IRQ remapping table. The index will find a
> >> IRQ remapping entry which controls interrupt routing for a specific
> >> device. If you allow a malicious program random index into MSI-X
> >> entry of assigned device, the hole is obvious...
> >>
> >> Above might make sense only for a IRQ remapping implementation
> >> which doesn't rely on extended MSI-X format (e.g. simply based on
> >> BDF). If that's the case for PPC, then you should build MSI-X
> >> passthrough based on this fact instead of general IRQ remapping
> >> enabled or not.  
> >
> > I don't think anyone is expecting that we can expose the MSI-X vector
> > table to the guest and the guest can make direct use of it.  The end
> > goal here is that the guest on a power system is already
> > paravirtualized to not program the device MSI-X by directly writing to
> > the MSI-X vector table.  They have hypercalls for this since they
> > always run virtualized.  Therefore a) they never intend to touch the
> > MSI-X vector table and b) they have sufficient isolation that a guest
> > can only hurt itself by doing so.
> >
> > On x86 we don't have a), our method of programming the MSI-X vector
> > table is to directly write to it. Therefore we will always require QEMU
> > to place a MemoryRegion over the vector table to intercept those
> > accesses.  However with interrupt remapping, we do have b) on x86, which
> > means that we don't need to be so strict in disallowing user accesses
> > to the MSI-X vector table.  It's not useful for configuring MSI-X on
> > the device, but the user should only be able to hurt themselves by
> > writing it directly.  x86 doesn't really get anything out of this
> > change, but it helps this special case on power pretty significantly
> > aiui.  Thanks,  
> 
> Excellent short overview, saved :)
> 
> How do we proceed with these patches? Nobody seems objecting them but also 
> nobody seems taking them either...

Well, this series is still based on some non-upstream patches, so...
Once that dependency is resolved this series should probably be split
into functional areas for acceptance by the appropriate subsystem
maintainers.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-05 15:05                   ` Alex Williamson
  2016-05-06  6:35                     ` Alexey Kardashevskiy
@ 2016-05-11  6:29                     ` Tian, Kevin
  2016-05-11 15:53                       ` Alex Williamson
  1 sibling, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2016-05-11  6:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 05, 2016 11:05 PM
> 
> On Thu, 5 May 2016 12:15:46 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > > Sent: Thursday, May 05, 2016 7:43 PM
> > >
> > > Hi David and Kevin,
> > >
> > > On 2016/5/5 17:54, David Laight wrote:
> > >
> > > > From: Tian, Kevin
> > > >> Sent: 05 May 2016 10:37
> > > > ...
> > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > >>> can make sure guest kernel would not touch MSI-X table in
> > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > >>>
> > > >> Then how do you prevent malicious guest kernel accessing it?
> > > > Or a malicious guest driver for an ethernet card setting up
> > > > the receive buffer ring to contain a single word entry that
> > > > contains the address associated with an MSI-X interrupt and
> > > > then using a loopback mode to cause a specific packet be
> > > > received that writes the required word through that address.
> > > >
> > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > cycle.
> > > >
> > > > 	David
> > > >
> > >
> > > If we have enough permission to load a malicious driver or
> > > kernel, we can easily break the guest without exposed
> > > MSI-X table.
> > >
> > > I think it should be safe to expose MSI-X table if we can
> > > make sure that malicious guest driver/kernel can't use
> > > the MSI-X table to break other guest or host. The
> > > capability of IRQ remapping could provide this
> > > kind of protection.
> > >
> >
> > With IRQ remapping it doesn't mean you can pass through MSI-X
> > structure to guest. I know actual IRQ remapping might be platform
> > specific, but at least for Intel VT-d specification, MSI-X entry must
> > be configured with a remappable format by host kernel which
> > contains an index into IRQ remapping table. The index will find a
> > IRQ remapping entry which controls interrupt routing for a specific
> > device. If you allow a malicious program random index into MSI-X
> > entry of assigned device, the hole is obvious...
> >
> > Above might make sense only for a IRQ remapping implementation
> > which doesn't rely on extended MSI-X format (e.g. simply based on
> > BDF). If that's the case for PPC, then you should build MSI-X
> > passthrough based on this fact instead of general IRQ remapping
> > enabled or not.
> 
> I don't think anyone is expecting that we can expose the MSI-X vector
> table to the guest and the guest can make direct use of it.  The end
> goal here is that the guest on a power system is already
> paravirtualized to not program the device MSI-X by directly writing to
> the MSI-X vector table.  They have hypercalls for this since they
> always run virtualized.  Therefore a) they never intend to touch the
> MSI-X vector table and b) they have sufficient isolation that a guest
> can only hurt itself by doing so.
> 
> On x86 we don't have a), our method of programming the MSI-X vector
> table is to directly write to it. Therefore we will always require QEMU
> to place a MemoryRegion over the vector table to intercept those
> accesses.  However with interrupt remapping, we do have b) on x86, which
> means that we don't need to be so strict in disallowing user accesses
> to the MSI-X vector table.  It's not useful for configuring MSI-X on
> the device, but the user should only be able to hurt themselves by
> writing it directly.  x86 doesn't really get anything out of this
> change, but it helps this special case on power pretty significantly
> aiui.  Thanks,
> 

Allowing guest direct write to MSI-x table has system-wide impact.
As I explained earlier, hypervisor needs to control "interrupt_index"
programmed in MSI-X entry, which is used to associate a specific
IRQ remapping entry. Now if you expose whole MSI-x table to guest, 
it can program random index into MSI-X entry to associate with 
any IRQ remapping entry and then there won't be any isolation per se.

You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
spec.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-11  6:29                     ` Tian, Kevin
@ 2016-05-11 15:53                       ` Alex Williamson
  2016-05-12  1:19                         ` Tian, Kevin
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2016-05-11 15:53 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

On Wed, 11 May 2016 06:29:06 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, May 05, 2016 11:05 PM
> > 
> > On Thu, 5 May 2016 12:15:46 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > > > Sent: Thursday, May 05, 2016 7:43 PM
> > > >
> > > > Hi David and Kevin,
> > > >
> > > > On 2016/5/5 17:54, David Laight wrote:
> > > >  
> > > > > From: Tian, Kevin  
> > > > >> Sent: 05 May 2016 10:37  
> > > > > ...  
> > > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > > >>> can make sure guest kernel would not touch MSI-X table in
> > > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > > >>>  
> > > > >> Then how do you prevent malicious guest kernel accessing it?  
> > > > > Or a malicious guest driver for an ethernet card setting up
> > > > > the receive buffer ring to contain a single word entry that
> > > > > contains the address associated with an MSI-X interrupt and
> > > > > then using a loopback mode to cause a specific packet be
> > > > > received that writes the required word through that address.
> > > > >
> > > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > > cycle.
> > > > >
> > > > > 	David
> > > > >  
> > > >
> > > > If we have enough permission to load a malicious driver or
> > > > kernel, we can easily break the guest without exposed
> > > > MSI-X table.
> > > >
> > > > I think it should be safe to expose MSI-X table if we can
> > > > make sure that malicious guest driver/kernel can't use
> > > > the MSI-X table to break other guest or host. The
> > > > capability of IRQ remapping could provide this
> > > > kind of protection.
> > > >  
> > >
> > > With IRQ remapping it doesn't mean you can pass through MSI-X
> > > structure to guest. I know actual IRQ remapping might be platform
> > > specific, but at least for Intel VT-d specification, MSI-X entry must
> > > be configured with a remappable format by host kernel which
> > > contains an index into IRQ remapping table. The index will find a
> > > IRQ remapping entry which controls interrupt routing for a specific
> > > device. If you allow a malicious program random index into MSI-X
> > > entry of assigned device, the hole is obvious...
> > >
> > > Above might make sense only for a IRQ remapping implementation
> > > which doesn't rely on extended MSI-X format (e.g. simply based on
> > > BDF). If that's the case for PPC, then you should build MSI-X
> > > passthrough based on this fact instead of general IRQ remapping
> > > enabled or not.  
> > 
> > I don't think anyone is expecting that we can expose the MSI-X vector
> > table to the guest and the guest can make direct use of it.  The end
> > goal here is that the guest on a power system is already
> > paravirtualized to not program the device MSI-X by directly writing to
> > the MSI-X vector table.  They have hypercalls for this since they
> > always run virtualized.  Therefore a) they never intend to touch the
> > MSI-X vector table and b) they have sufficient isolation that a guest
> > can only hurt itself by doing so.
> > 
> > On x86 we don't have a), our method of programming the MSI-X vector
> > table is to directly write to it. Therefore we will always require QEMU
> > to place a MemoryRegion over the vector table to intercept those
> > accesses.  However with interrupt remapping, we do have b) on x86, which
> > means that we don't need to be so strict in disallowing user accesses
> > to the MSI-X vector table.  It's not useful for configuring MSI-X on
> > the device, but the user should only be able to hurt themselves by
> > writing it directly.  x86 doesn't really get anything out of this
> > change, but it helps this special case on power pretty significantly
> > aiui.  Thanks,
> >   
> 
> Allowing guest direct write to MSI-x table has system-wide impact.
> As I explained earlier, hypervisor needs to control "interrupt_index"
> programmed in MSI-X entry, which is used to associate a specific
> IRQ remapping entry. Now if you expose whole MSI-x table to guest, 
> it can program random index into MSI-X entry to associate with 
> any IRQ remapping entry and then there won't be any isolation per se.
> 
> You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
> spec.

I think you're extrapolating beyond the vfio interface.  The change
here is to remove the vfio protection of the MSI-X vector table when
the system provides interrupt isolation.  The argument is that this is
safe to do because the hardware protects the host from erroneous and
malicious user programming, but it is not meant to provide a means to
program MSI-X directly through the vector table.  This is effectively
the same as general DMA programming, if the vfio programming model is
not followed the device generates iommu faults.  I do have a concern
that userspace driver writers are going to more often presume they can
use the vector table directly because of this change, but I don't know
that that is sufficient reason to prevent such a change.  They'll
quickly discover the device generates faults on interrupt rather than
working as expected.

The question of how this affects the hypervisor is completely
separate.  Vfio in the kernel is a userspace driver interface, not a
hypervisor.  QEMU is the hypervisor.  We have no plans to provide the VM
with direct access to the MSI-X vector table for x86 guests on QEMU.
There will still be a MemoryRegion emulating access to the vector table
in order to translate writes into vfio interrupt ioctls.  POWER would
drop the MemoryRegion so that the full page is mapped to the guest,
with the expectation that the guest never makes use of it since MSI-X
is always configured via hypercalls on POWER systems.  Likewise I
expect ARM will still make use of the MemoryRegion emulating the vector
table, which leaves them exposed to the performance issue POWER is
trying to solve here since ARM also has 64k page support and has no
paravirtualized MSI-X programming interface afaik.  x86 is not
impervious to this issue either, but a 4k page size falls within the
PCI spec recommendations for MSI-X structure alignment, so it's much
more rare to have issues. We have certainly seen hardware vendors that
ignore the PCI spec alignment recommendations, but so far only for
placing device registers within the same page as the PBA, which is an
easier problem to deal with since the PBA is relatively unused by
drivers.  This may be an area where we need to develop a paravirt
interface for MSI-X programming which disable the MemoryRegion
emulating the vector table when used.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-11 15:53                       ` Alex Williamson
@ 2016-05-12  1:19                         ` Tian, Kevin
  2016-05-12  2:20                           ` Alex Williamson
  0 siblings, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2016-05-12  1:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, May 11, 2016 11:54 PM
> 
> On Wed, 11 May 2016 06:29:06 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, May 05, 2016 11:05 PM
> > >
> > > On Thu, 5 May 2016 12:15:46 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > > > > Sent: Thursday, May 05, 2016 7:43 PM
> > > > >
> > > > > Hi David and Kevin,
> > > > >
> > > > > On 2016/5/5 17:54, David Laight wrote:
> > > > >
> > > > > > From: Tian, Kevin
> > > > > >> Sent: 05 May 2016 10:37
> > > > > > ...
> > > > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > > > >>> can make sure guest kernel would not touch MSI-X table in
> > > > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > > > >>>
> > > > > >> Then how do you prevent malicious guest kernel accessing it?
> > > > > > Or a malicious guest driver for an ethernet card setting up
> > > > > > the receive buffer ring to contain a single word entry that
> > > > > > contains the address associated with an MSI-X interrupt and
> > > > > > then using a loopback mode to cause a specific packet be
> > > > > > received that writes the required word through that address.
> > > > > >
> > > > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > > > cycle.
> > > > > >
> > > > > > 	David
> > > > > >
> > > > >
> > > > > If we have enough permission to load a malicious driver or
> > > > > kernel, we can easily break the guest without exposed
> > > > > MSI-X table.
> > > > >
> > > > > I think it should be safe to expose MSI-X table if we can
> > > > > make sure that malicious guest driver/kernel can't use
> > > > > the MSI-X table to break other guest or host. The
> > > > > capability of IRQ remapping could provide this
> > > > > kind of protection.
> > > > >
> > > >
> > > > With IRQ remapping it doesn't mean you can pass through MSI-X
> > > > structure to guest. I know actual IRQ remapping might be platform
> > > > specific, but at least for Intel VT-d specification, MSI-X entry must
> > > > be configured with a remappable format by host kernel which
> > > > contains an index into IRQ remapping table. The index will find a
> > > > IRQ remapping entry which controls interrupt routing for a specific
> > > > device. If you allow a malicious program random index into MSI-X
> > > > entry of assigned device, the hole is obvious...
> > > >
> > > > Above might make sense only for a IRQ remapping implementation
> > > > which doesn't rely on extended MSI-X format (e.g. simply based on
> > > > BDF). If that's the case for PPC, then you should build MSI-X
> > > > passthrough based on this fact instead of general IRQ remapping
> > > > enabled or not.
> > >
> > > I don't think anyone is expecting that we can expose the MSI-X vector
> > > table to the guest and the guest can make direct use of it.  The end
> > > goal here is that the guest on a power system is already
> > > paravirtualized to not program the device MSI-X by directly writing to
> > > the MSI-X vector table.  They have hypercalls for this since they
> > > always run virtualized.  Therefore a) they never intend to touch the
> > > MSI-X vector table and b) they have sufficient isolation that a guest
> > > can only hurt itself by doing so.
> > >
> > > On x86 we don't have a), our method of programming the MSI-X vector
> > > table is to directly write to it. Therefore we will always require QEMU
> > > to place a MemoryRegion over the vector table to intercept those
> > > accesses.  However with interrupt remapping, we do have b) on x86, which
> > > means that we don't need to be so strict in disallowing user accesses
> > > to the MSI-X vector table.  It's not useful for configuring MSI-X on
> > > the device, but the user should only be able to hurt themselves by
> > > writing it directly.  x86 doesn't really get anything out of this
> > > change, but it helps this special case on power pretty significantly
> > > aiui.  Thanks,
> > >
> >
> > Allowing guest direct write to MSI-x table has system-wide impact.
> > As I explained earlier, hypervisor needs to control "interrupt_index"
> > programmed in MSI-X entry, which is used to associate a specific
> > IRQ remapping entry. Now if you expose whole MSI-x table to guest,
> > it can program random index into MSI-X entry to associate with
> > any IRQ remapping entry and then there won't be any isolation per se.
> >
> > You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
> > spec.
> 
> I think you're extrapolating beyond the vfio interface.  The change
> here is to remove the vfio protection of the MSI-X vector table when
> the system provides interrupt isolation.  The argument is that this is
> safe to do because the hardware protects the host from erroneous and
> malicious user programming, but it is not meant to provide a means to
> program MSI-X directly through the vector table.  This is effectively

Sorry I didn't get this point. Once we allow userspace to mmap MSI-X
table, isn't it equivalent to allowing userspace directly program vector
table? Or is there other mechanism to prevent direct programming?

> the same as general DMA programming, if the vfio programming model is
> not followed the device generates iommu faults.  I do have a concern
> that userspace driver writers are going to more often presume they can
> use the vector table directly because of this change, but I don't know
> that that is sufficient reason to prevent such a change.  They'll
> quickly discover the device generates faults on interrupt rather than
> working as expected.

If userspace can actually program vector table directly, there is not
always fault triggered. As long as MSI-X table is fully under control
of userspace, any interrupt index can be used here which may link
to a IRQ remapping entry allocated for other devices.

> 
> The question of how this affects the hypervisor is completely
> separate.  Vfio in the kernel is a userspace driver interface, not a
> hypervisor.  QEMU is the hypervisor.  We have no plans to provide the VM
> with direct access to the MSI-X vector table for x86 guests on QEMU.
> There will still be a MemoryRegion emulating access to the vector table
> in order to translate writes into vfio interrupt ioctls.  POWER would
> drop the MemoryRegion so that the full page is mapped to the guest,
> with the expectation that the guest never makes use of it since MSI-X
> is always configured via hypercalls on POWER systems.  Likewise I
> expect ARM will still make use of the MemoryRegion emulating the vector
> table, which leaves them exposed to the performance issue POWER is
> trying to solve here since ARM also has 64k page support and has no
> paravirtualized MSI-X programming interface afaik.  x86 is not
> impervious to this issue either, but a 4k page size falls within the
> PCI spec recommendations for MSI-X structure alignment, so it's much
> more rare to have issues. We have certainly seen hardware vendors that
> ignore the PCI spec alignment recommendations, but so far only for
> placing device registers within the same page as the PBA, which is an
> easier problem to deal with since the PBA is relatively unused by
> drivers.  This may be an area where we need to develop a paravirt
> interface for MSI-X programming which disable the MemoryRegion
> emulating the vector table when used.  Thanks,
> 

I get this point. I incorrectly line allowing userspace mmap MSI-X
table to allowing guest direct access to MSI-X.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-12  1:19                         ` Tian, Kevin
@ 2016-05-12  2:20                           ` Alex Williamson
  2016-05-12  4:53                             ` Tian, Kevin
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2016-05-12  2:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

On Thu, 12 May 2016 01:19:44 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, May 11, 2016 11:54 PM
> > 
> > On Wed, 11 May 2016 06:29:06 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, May 05, 2016 11:05 PM
> > > >
> > > > On Thu, 5 May 2016 12:15:46 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > > > > > Sent: Thursday, May 05, 2016 7:43 PM
> > > > > >
> > > > > > Hi David and Kevin,
> > > > > >
> > > > > > On 2016/5/5 17:54, David Laight wrote:
> > > > > >  
> > > > > > > From: Tian, Kevin  
> > > > > > >> Sent: 05 May 2016 10:37  
> > > > > > > ...  
> > > > > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > > > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > > > > >>> can make sure guest kernel would not touch MSI-X table in
> > > > > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > > > > >>>  
> > > > > > >> Then how do you prevent malicious guest kernel accessing it?  
> > > > > > > Or a malicious guest driver for an ethernet card setting up
> > > > > > > the receive buffer ring to contain a single word entry that
> > > > > > > contains the address associated with an MSI-X interrupt and
> > > > > > > then using a loopback mode to cause a specific packet be
> > > > > > > received that writes the required word through that address.
> > > > > > >
> > > > > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > > > > cycle.
> > > > > > >
> > > > > > > 	David
> > > > > > >  
> > > > > >
> > > > > > If we have enough permission to load a malicious driver or
> > > > > > kernel, we can easily break the guest without exposed
> > > > > > MSI-X table.
> > > > > >
> > > > > > I think it should be safe to expose MSI-X table if we can
> > > > > > make sure that malicious guest driver/kernel can't use
> > > > > > the MSI-X table to break other guest or host. The
> > > > > > capability of IRQ remapping could provide this
> > > > > > kind of protection.
> > > > > >  
> > > > >
> > > > > With IRQ remapping it doesn't mean you can pass through MSI-X
> > > > > structure to guest. I know actual IRQ remapping might be platform
> > > > > specific, but at least for Intel VT-d specification, MSI-X entry must
> > > > > be configured with a remappable format by host kernel which
> > > > > contains an index into IRQ remapping table. The index will find a
> > > > > IRQ remapping entry which controls interrupt routing for a specific
> > > > > device. If you allow a malicious program random index into MSI-X
> > > > > entry of assigned device, the hole is obvious...
> > > > >
> > > > > Above might make sense only for a IRQ remapping implementation
> > > > > which doesn't rely on extended MSI-X format (e.g. simply based on
> > > > > BDF). If that's the case for PPC, then you should build MSI-X
> > > > > passthrough based on this fact instead of general IRQ remapping
> > > > > enabled or not.  
> > > >
> > > > I don't think anyone is expecting that we can expose the MSI-X vector
> > > > table to the guest and the guest can make direct use of it.  The end
> > > > goal here is that the guest on a power system is already
> > > > paravirtualized to not program the device MSI-X by directly writing to
> > > > the MSI-X vector table.  They have hypercalls for this since they
> > > > always run virtualized.  Therefore a) they never intend to touch the
> > > > MSI-X vector table and b) they have sufficient isolation that a guest
> > > > can only hurt itself by doing so.
> > > >
> > > > On x86 we don't have a), our method of programming the MSI-X vector
> > > > table is to directly write to it. Therefore we will always require QEMU
> > > > to place a MemoryRegion over the vector table to intercept those
> > > > accesses.  However with interrupt remapping, we do have b) on x86, which
> > > > means that we don't need to be so strict in disallowing user accesses
> > > > to the MSI-X vector table.  It's not useful for configuring MSI-X on
> > > > the device, but the user should only be able to hurt themselves by
> > > > writing it directly.  x86 doesn't really get anything out of this
> > > > change, but it helps this special case on power pretty significantly
> > > > aiui.  Thanks,
> > > >  
> > >
> > > Allowing guest direct write to MSI-x table has system-wide impact.
> > > As I explained earlier, hypervisor needs to control "interrupt_index"
> > > programmed in MSI-X entry, which is used to associate a specific
> > > IRQ remapping entry. Now if you expose whole MSI-x table to guest,
> > > it can program random index into MSI-X entry to associate with
> > > any IRQ remapping entry and then there won't be any isolation per se.
> > >
> > > You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
> > > spec.  
> > 
> > I think you're extrapolating beyond the vfio interface.  The change
> > here is to remove the vfio protection of the MSI-X vector table when
> > the system provides interrupt isolation.  The argument is that this is
> > safe to do because the hardware protects the host from erroneous and
> > malicious user programming, but it is not meant to provide a means to
> > program MSI-X directly through the vector table.  This is effectively  
> 
> Sorry I didn't get this point. Once we allow userspace to mmap MSI-X
> table, isn't it equivalent to allowing userspace directly program vector
> table? Or is there other mechanism to prevent direct programming?

This would remove the mechanism that prevents direct programming.
Users can configure the device to perform DMA w/o setting up an IOMMU
mapping, which generates a fault.  Users can incorrectly manipulate the
MSI-X vector table instead of using the proper vfio programming
interface, which generates a fault.

> > the same as general DMA programming, if the vfio programming model is
> > not followed the device generates iommu faults.  I do have a concern
> > that userspace driver writers are going to more often presume they can
> > use the vector table directly because of this change, but I don't know
> > that that is sufficient reason to prevent such a change.  They'll
> > quickly discover the device generates faults on interrupt rather than
> > working as expected.  
> 
> If userspace can actually program vector table directly, there is not
> always fault triggered. As long as MSI-X table is fully under control
> of userspace, any interrupt index can be used here which may link
> to a IRQ remapping entry allocated for other devices.

Is not part of the vector lookup comparing the source ID of the DMA
write?  If not then VT-d interrupt remapping offers us no protection
from a malicious guest performing DMA to the same address.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-12  2:20                           ` Alex Williamson
@ 2016-05-12  4:53                             ` Tian, Kevin
  2016-05-12 17:47                               ` Alex Williamson
  0 siblings, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2016-05-12  4:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 12, 2016 10:21 AM
> 
> On Thu, 12 May 2016 01:19:44 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, May 11, 2016 11:54 PM
> > >
> > > On Wed, 11 May 2016 06:29:06 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, May 05, 2016 11:05 PM
> > > > >
> > > > > On Thu, 5 May 2016 12:15:46 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >
> > > > > > > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > > > > > > Sent: Thursday, May 05, 2016 7:43 PM
> > > > > > >
> > > > > > > Hi David and Kevin,
> > > > > > >
> > > > > > > On 2016/5/5 17:54, David Laight wrote:
> > > > > > >
> > > > > > > > From: Tian, Kevin
> > > > > > > >> Sent: 05 May 2016 10:37
> > > > > > > > ...
> > > > > > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > > > > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > > > > > >>> can make sure guest kernel would not touch MSI-X table in
> > > > > > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > > > > > >>>
> > > > > > > >> Then how do you prevent malicious guest kernel accessing it?
> > > > > > > > Or a malicious guest driver for an ethernet card setting up
> > > > > > > > the receive buffer ring to contain a single word entry that
> > > > > > > > contains the address associated with an MSI-X interrupt and
> > > > > > > > then using a loopback mode to cause a specific packet be
> > > > > > > > received that writes the required word through that address.
> > > > > > > >
> > > > > > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > > > > > cycle.
> > > > > > > >
> > > > > > > > 	David
> > > > > > > >
> > > > > > >
> > > > > > > If we have enough permission to load a malicious driver or
> > > > > > > kernel, we can easily break the guest without exposed
> > > > > > > MSI-X table.
> > > > > > >
> > > > > > > I think it should be safe to expose MSI-X table if we can
> > > > > > > make sure that malicious guest driver/kernel can't use
> > > > > > > the MSI-X table to break other guest or host. The
> > > > > > > capability of IRQ remapping could provide this
> > > > > > > kind of protection.
> > > > > > >
> > > > > >
> > > > > > With IRQ remapping it doesn't mean you can pass through MSI-X
> > > > > > structure to guest. I know actual IRQ remapping might be platform
> > > > > > specific, but at least for Intel VT-d specification, MSI-X entry must
> > > > > > be configured with a remappable format by host kernel which
> > > > > > contains an index into IRQ remapping table. The index will find a
> > > > > > IRQ remapping entry which controls interrupt routing for a specific
> > > > > > device. If you allow a malicious program random index into MSI-X
> > > > > > entry of assigned device, the hole is obvious...
> > > > > >
> > > > > > Above might make sense only for a IRQ remapping implementation
> > > > > > which doesn't rely on extended MSI-X format (e.g. simply based on
> > > > > > BDF). If that's the case for PPC, then you should build MSI-X
> > > > > > passthrough based on this fact instead of general IRQ remapping
> > > > > > enabled or not.
> > > > >
> > > > > I don't think anyone is expecting that we can expose the MSI-X vector
> > > > > table to the guest and the guest can make direct use of it.  The end
> > > > > goal here is that the guest on a power system is already
> > > > > paravirtualized to not program the device MSI-X by directly writing to
> > > > > the MSI-X vector table.  They have hypercalls for this since they
> > > > > always run virtualized.  Therefore a) they never intend to touch the
> > > > > MSI-X vector table and b) they have sufficient isolation that a guest
> > > > > can only hurt itself by doing so.
> > > > >
> > > > > On x86 we don't have a), our method of programming the MSI-X vector
> > > > > table is to directly write to it. Therefore we will always require QEMU
> > > > > to place a MemoryRegion over the vector table to intercept those
> > > > > accesses.  However with interrupt remapping, we do have b) on x86, which
> > > > > means that we don't need to be so strict in disallowing user accesses
> > > > > to the MSI-X vector table.  It's not useful for configuring MSI-X on
> > > > > the device, but the user should only be able to hurt themselves by
> > > > > writing it directly.  x86 doesn't really get anything out of this
> > > > > change, but it helps this special case on power pretty significantly
> > > > > aiui.  Thanks,
> > > > >
> > > >
> > > > Allowing guest direct write to MSI-x table has system-wide impact.
> > > > As I explained earlier, hypervisor needs to control "interrupt_index"
> > > > programmed in MSI-X entry, which is used to associate a specific
> > > > IRQ remapping entry. Now if you expose whole MSI-x table to guest,
> > > > it can program random index into MSI-X entry to associate with
> > > > any IRQ remapping entry and then there won't be any isolation per se.
> > > >
> > > > You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
> > > > spec.
> > >
> > > I think you're extrapolating beyond the vfio interface.  The change
> > > here is to remove the vfio protection of the MSI-X vector table when
> > > the system provides interrupt isolation.  The argument is that this is
> > > safe to do because the hardware protects the host from erroneous and
> > > malicious user programming, but it is not meant to provide a means to
> > > program MSI-X directly through the vector table.  This is effectively
> >
> > Sorry I didn't get this point. Once we allow userspace to mmap MSI-X
> > table, isn't it equivalent to allowing userspace directly program vector
> > table? Or is there other mechanism to prevent direct programming?
> 
> This would remove the mechanism that prevents direct programming.
> Users can configure the device to perform DMA w/o setting up an IOMMU
> mapping, which generates a fault.  Users can incorrectly manipulate the
> MSI-X vector table instead of using the proper vfio programming
> interface, which generates a fault.
> 
> > > the same as general DMA programming, if the vfio programming model is
> > > not followed the device generates iommu faults.  I do have a concern
> > > that userspace driver writers are going to more often presume they can
> > > use the vector table directly because of this change, but I don't know
> > > that that is sufficient reason to prevent such a change.  They'll
> > > quickly discover the device generates faults on interrupt rather than
> > > working as expected.
> >
> > If userspace can actually program vector table directly, there is not
> > always fault triggered. As long as MSI-X table is fully under control
> > of userspace, any interrupt index can be used here which may link
> > to a IRQ remapping entry allocated for other devices.
> 
> Is not part of the vector lookup comparing the source ID of the DMA
> write?  If not then VT-d interrupt remapping offers us no protection
> from a malicious guest performing DMA to the same address.  Thanks,
> 

For x86 IRQ remapping doesn't rely on existing DMAR remapping table for
DMA. Instead IRQ remapping table is a global table per VT-d engine, shared
by all devices behind this VT-d engine. Each IRQ remapping table entry 
(IRTE) can specify:

- whether to validate source-id of the interrupt requests;
- whether to verify the whole source id;
- whether to verify some bits of the source id;
...
(section 9.10 Interrupt Remapping Table Entry (IRTE) in VT-d spec)

IRTE index is programmed into corresponding MSI-X entry of a device.
When an interrupt message is triggered from that device, IRQ remapping
hardware will use the index to find the IRTE entry.

As long as host kernel has strict control of MSI-X table, it can choose to
not validate source-id as long as IRTE index is trusted.

That is why I concerned you cannot do this simply based on whether 
IRQ remapping is available on x86. If such usage is really required,
you'll need some more accurate indicator per iommu structure to tell
whether safe to allow mmap MSI-X to user space.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-12  4:53                             ` Tian, Kevin
@ 2016-05-12 17:47                               ` Alex Williamson
  2016-05-13  2:33                                 ` Tian, Kevin
  2016-05-13  2:36                                 ` Tian, Kevin
  0 siblings, 2 replies; 39+ messages in thread
From: Alex Williamson @ 2016-05-12 17:47 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

On Thu, 12 May 2016 04:53:19 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, May 12, 2016 10:21 AM
> > 
> > On Thu, 12 May 2016 01:19:44 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, May 11, 2016 11:54 PM
> > > >
> > > > On Wed, 11 May 2016 06:29:06 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Thursday, May 05, 2016 11:05 PM
> > > > > >
> > > > > > On Thu, 5 May 2016 12:15:46 +0000
> > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > >  
> > > > > > > > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > > > > > > > Sent: Thursday, May 05, 2016 7:43 PM
> > > > > > > >
> > > > > > > > Hi David and Kevin,
> > > > > > > >
> > > > > > > > On 2016/5/5 17:54, David Laight wrote:
> > > > > > > >  
> > > > > > > > > From: Tian, Kevin  
> > > > > > > > >> Sent: 05 May 2016 10:37  
> > > > > > > > > ...  
> > > > > > > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > > > > > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > > > > > > >>> can make sure guest kernel would not touch MSI-X table in
> > > > > > > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > > > > > > >>>  
> > > > > > > > >> Then how do you prevent malicious guest kernel accessing it?  
> > > > > > > > > Or a malicious guest driver for an ethernet card setting up
> > > > > > > > > the receive buffer ring to contain a single word entry that
> > > > > > > > > contains the address associated with an MSI-X interrupt and
> > > > > > > > > then using a loopback mode to cause a specific packet be
> > > > > > > > > received that writes the required word through that address.
> > > > > > > > >
> > > > > > > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > > > > > > cycle.
> > > > > > > > >
> > > > > > > > > 	David
> > > > > > > > >  
> > > > > > > >
> > > > > > > > If we have enough permission to load a malicious driver or
> > > > > > > > kernel, we can easily break the guest without exposed
> > > > > > > > MSI-X table.
> > > > > > > >
> > > > > > > > I think it should be safe to expose MSI-X table if we can
> > > > > > > > make sure that malicious guest driver/kernel can't use
> > > > > > > > the MSI-X table to break other guest or host. The
> > > > > > > > capability of IRQ remapping could provide this
> > > > > > > > kind of protection.
> > > > > > > >  
> > > > > > >
> > > > > > > With IRQ remapping it doesn't mean you can pass through MSI-X
> > > > > > > structure to guest. I know actual IRQ remapping might be platform
> > > > > > > specific, but at least for Intel VT-d specification, MSI-X entry must
> > > > > > > be configured with a remappable format by host kernel which
> > > > > > > contains an index into IRQ remapping table. The index will find a
> > > > > > > IRQ remapping entry which controls interrupt routing for a specific
> > > > > > > device. If you allow a malicious program random index into MSI-X
> > > > > > > entry of assigned device, the hole is obvious...
> > > > > > >
> > > > > > > Above might make sense only for a IRQ remapping implementation
> > > > > > > which doesn't rely on extended MSI-X format (e.g. simply based on
> > > > > > > BDF). If that's the case for PPC, then you should build MSI-X
> > > > > > > passthrough based on this fact instead of general IRQ remapping
> > > > > > > enabled or not.  
> > > > > >
> > > > > > I don't think anyone is expecting that we can expose the MSI-X vector
> > > > > > table to the guest and the guest can make direct use of it.  The end
> > > > > > goal here is that the guest on a power system is already
> > > > > > paravirtualized to not program the device MSI-X by directly writing to
> > > > > > the MSI-X vector table.  They have hypercalls for this since they
> > > > > > always run virtualized.  Therefore a) they never intend to touch the
> > > > > > MSI-X vector table and b) they have sufficient isolation that a guest
> > > > > > can only hurt itself by doing so.
> > > > > >
> > > > > > On x86 we don't have a), our method of programming the MSI-X vector
> > > > > > table is to directly write to it. Therefore we will always require QEMU
> > > > > > to place a MemoryRegion over the vector table to intercept those
> > > > > > accesses.  However with interrupt remapping, we do have b) on x86, which
> > > > > > means that we don't need to be so strict in disallowing user accesses
> > > > > > to the MSI-X vector table.  It's not useful for configuring MSI-X on
> > > > > > the device, but the user should only be able to hurt themselves by
> > > > > > writing it directly.  x86 doesn't really get anything out of this
> > > > > > change, but it helps this special case on power pretty significantly
> > > > > > aiui.  Thanks,
> > > > > >  
> > > > >
> > > > > Allowing guest direct write to MSI-x table has system-wide impact.
> > > > > As I explained earlier, hypervisor needs to control "interrupt_index"
> > > > > programmed in MSI-X entry, which is used to associate a specific
> > > > > IRQ remapping entry. Now if you expose whole MSI-x table to guest,
> > > > > it can program random index into MSI-X entry to associate with
> > > > > any IRQ remapping entry and then there won't be any isolation per se.
> > > > >
> > > > > You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
> > > > > spec.  
> > > >
> > > > I think you're extrapolating beyond the vfio interface.  The change
> > > > here is to remove the vfio protection of the MSI-X vector table when
> > > > the system provides interrupt isolation.  The argument is that this is
> > > > safe to do because the hardware protects the host from erroneous and
> > > > malicious user programming, but it is not meant to provide a means to
> > > > program MSI-X directly through the vector table.  This is effectively  
> > >
> > > Sorry I didn't get this point. Once we allow userspace to mmap MSI-X
> > > table, isn't it equivalent to allowing userspace directly program vector
> > > table? Or is there other mechanism to prevent direct programming?  
> > 
> > This would remove the mechanism that prevents direct programming.
> > Users can configure the device to perform DMA w/o setting up an IOMMU
> > mapping, which generates a fault.  Users can incorrectly manipulate the
> > MSI-X vector table instead of using the proper vfio programming
> > interface, which generates a fault.
> >   
> > > > the same as general DMA programming, if the vfio programming model is
> > > > not followed the device generates iommu faults.  I do have a concern
> > > > that userspace driver writers are going to more often presume they can
> > > > use the vector table directly because of this change, but I don't know
> > > > that that is sufficient reason to prevent such a change.  They'll
> > > > quickly discover the device generates faults on interrupt rather than
> > > > working as expected.  
> > >
> > > If userspace can actually program vector table directly, there is not
> > > always fault triggered. As long as MSI-X table is fully under control
> > > of userspace, any interrupt index can be used here which may link
> > > to a IRQ remapping entry allocated for other devices.  
> > 
> > Is not part of the vector lookup comparing the source ID of the DMA
> > write?  If not then VT-d interrupt remapping offers us no protection
> > from a malicious guest performing DMA to the same address.  Thanks,
> >   
> 
> For x86 IRQ remapping doesn't rely on existing DMAR remapping table for
> DMA. Instead IRQ remapping table is a global table per VT-d engine, shared
> by all devices behind this VT-d engine. Each IRQ remapping table entry 
> (IRTE) can specify:
> 
> - whether to validate source-id of the interrupt requests;
> - whether to verify the whole source id;
> - whether to verify some bits of the source id;
> ...
> (section 9.10 Interrupt Remapping Table Entry (IRTE) in VT-d spec)
> 
> IRTE index is programmed into corresponding MSI-X entry of a device.
> When an interrupt message is triggered from that device, IRQ remapping
> hardware will use the index to find the IRTE entry.
> 
> As long as host kernel has strict control of MSI-X table, it can choose to
> not validate source-id as long as IRTE index is trusted.
> 
> That is why I concerned you cannot do this simply based on whether 
> IRQ remapping is available on x86. If such usage is really required,
> you'll need some more accurate indicator per iommu structure to tell
> whether safe to allow mmap MSI-X to user space.

As argued previously in this thread, there's nothing special about a
DMA write to memory versus a DMA write to a special address that
triggers an MSI vector.  If the device is DMA capable, which we assume
all are, it can be fooled into generating those DMA writes regardless
of whether we actively block access to the MSI-X vector table itself.

With respect to the IRTE, these entries are always under host control,
and aside from the potential opt-in gap when using the intremap=nosid
boot option, they always make use of source-id validation as Linux is
written today.  Compatibility mode support is also disabled.  Nothing
here changes that.  In fact, I suspect the only advantage to blocking
MSI-X vector table access w/o interrupt remapping is to avoid obvious
collisions if it were to be programmed directly, it doesn't actually
prevent an identical DMA transaction from being generated by other
means.  The MSI-X vector table of a device is always considered
untrusted which is why we require user opt-ins to subvert that
protection.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-12 17:47                               ` Alex Williamson
@ 2016-05-13  2:33                                 ` Tian, Kevin
  2016-05-13  5:32                                   ` Alex Williamson
  2016-05-13  2:36                                 ` Tian, Kevin
  1 sibling, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2016-05-13  2:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, May 13, 2016 1:48 AM
> 
> On Thu, 12 May 2016 04:53:19 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, May 12, 2016 10:21 AM
> > >
> > > On Thu, 12 May 2016 01:19:44 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Wednesday, May 11, 2016 11:54 PM
> > > > >
> > > > > On Wed, 11 May 2016 06:29:06 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Thursday, May 05, 2016 11:05 PM
> > > > > > >
> > > > > > > On Thu, 5 May 2016 12:15:46 +0000
> > > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > > >
> > > > > > > > > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > > > > > > > > Sent: Thursday, May 05, 2016 7:43 PM
> > > > > > > > >
> > > > > > > > > Hi David and Kevin,
> > > > > > > > >
> > > > > > > > > On 2016/5/5 17:54, David Laight wrote:
> > > > > > > > >
> > > > > > > > > > From: Tian, Kevin
> > > > > > > > > >> Sent: 05 May 2016 10:37
> > > > > > > > > > ...
> > > > > > > > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > > > > > > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > > > > > > > >>> can make sure guest kernel would not touch MSI-X table in
> > > > > > > > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > > > > > > > >>>
> > > > > > > > > >> Then how do you prevent malicious guest kernel accessing it?
> > > > > > > > > > Or a malicious guest driver for an ethernet card setting up
> > > > > > > > > > the receive buffer ring to contain a single word entry that
> > > > > > > > > > contains the address associated with an MSI-X interrupt and
> > > > > > > > > > then using a loopback mode to cause a specific packet be
> > > > > > > > > > received that writes the required word through that address.
> > > > > > > > > >
> > > > > > > > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > > > > > > > cycle.
> > > > > > > > > >
> > > > > > > > > > 	David
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > If we have enough permission to load a malicious driver or
> > > > > > > > > kernel, we can easily break the guest without exposed
> > > > > > > > > MSI-X table.
> > > > > > > > >
> > > > > > > > > I think it should be safe to expose MSI-X table if we can
> > > > > > > > > make sure that malicious guest driver/kernel can't use
> > > > > > > > > the MSI-X table to break other guest or host. The
> > > > > > > > > capability of IRQ remapping could provide this
> > > > > > > > > kind of protection.
> > > > > > > > >
> > > > > > > >
> > > > > > > > With IRQ remapping it doesn't mean you can pass through MSI-X
> > > > > > > > structure to guest. I know actual IRQ remapping might be platform
> > > > > > > > specific, but at least for Intel VT-d specification, MSI-X entry must
> > > > > > > > be configured with a remappable format by host kernel which
> > > > > > > > contains an index into IRQ remapping table. The index will find a
> > > > > > > > IRQ remapping entry which controls interrupt routing for a specific
> > > > > > > > device. If you allow a malicious program random index into MSI-X
> > > > > > > > entry of assigned device, the hole is obvious...
> > > > > > > >
> > > > > > > > Above might make sense only for a IRQ remapping implementation
> > > > > > > > which doesn't rely on extended MSI-X format (e.g. simply based on
> > > > > > > > BDF). If that's the case for PPC, then you should build MSI-X
> > > > > > > > passthrough based on this fact instead of general IRQ remapping
> > > > > > > > enabled or not.
> > > > > > >
> > > > > > > I don't think anyone is expecting that we can expose the MSI-X vector
> > > > > > > table to the guest and the guest can make direct use of it.  The end
> > > > > > > goal here is that the guest on a power system is already
> > > > > > > paravirtualized to not program the device MSI-X by directly writing to
> > > > > > > the MSI-X vector table.  They have hypercalls for this since they
> > > > > > > always run virtualized.  Therefore a) they never intend to touch the
> > > > > > > MSI-X vector table and b) they have sufficient isolation that a guest
> > > > > > > can only hurt itself by doing so.
> > > > > > >
> > > > > > > On x86 we don't have a), our method of programming the MSI-X vector
> > > > > > > table is to directly write to it. Therefore we will always require QEMU
> > > > > > > to place a MemoryRegion over the vector table to intercept those
> > > > > > > accesses.  However with interrupt remapping, we do have b) on x86, which
> > > > > > > means that we don't need to be so strict in disallowing user accesses
> > > > > > > to the MSI-X vector table.  It's not useful for configuring MSI-X on
> > > > > > > the device, but the user should only be able to hurt themselves by
> > > > > > > writing it directly.  x86 doesn't really get anything out of this
> > > > > > > change, but it helps this special case on power pretty significantly
> > > > > > > aiui.  Thanks,
> > > > > > >
> > > > > >
> > > > > > Allowing guest direct write to MSI-x table has system-wide impact.
> > > > > > As I explained earlier, hypervisor needs to control "interrupt_index"
> > > > > > programmed in MSI-X entry, which is used to associate a specific
> > > > > > IRQ remapping entry. Now if you expose whole MSI-x table to guest,
> > > > > > it can program random index into MSI-X entry to associate with
> > > > > > any IRQ remapping entry and then there won't be any isolation per se.
> > > > > >
> > > > > > You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
> > > > > > spec.
> > > > >
> > > > > I think you're extrapolating beyond the vfio interface.  The change
> > > > > here is to remove the vfio protection of the MSI-X vector table when
> > > > > the system provides interrupt isolation.  The argument is that this is
> > > > > safe to do because the hardware protects the host from erroneous and
> > > > > malicious user programming, but it is not meant to provide a means to
> > > > > program MSI-X directly through the vector table.  This is effectively
> > > >
> > > > Sorry I didn't get this point. Once we allow userspace to mmap MSI-X
> > > > table, isn't it equivalent to allowing userspace directly program vector
> > > > table? Or is there other mechanism to prevent direct programming?
> > >
> > > This would remove the mechanism that prevents direct programming.
> > > Users can configure the device to perform DMA w/o setting up an IOMMU
> > > mapping, which generates a fault.  Users can incorrectly manipulate the
> > > MSI-X vector table instead of using the proper vfio programming
> > > interface, which generates a fault.
> > >
> > > > > the same as general DMA programming, if the vfio programming model is
> > > > > not followed the device generates iommu faults.  I do have a concern
> > > > > that userspace driver writers are going to more often presume they can
> > > > > use the vector table directly because of this change, but I don't know
> > > > > that that is sufficient reason to prevent such a change.  They'll
> > > > > quickly discover the device generates faults on interrupt rather than
> > > > > working as expected.
> > > >
> > > > If userspace can actually program vector table directly, there is not
> > > > always fault triggered. As long as MSI-X table is fully under control
> > > > of userspace, any interrupt index can be used here which may link
> > > > to a IRQ remapping entry allocated for other devices.
> > >
> > > Is not part of the vector lookup comparing the source ID of the DMA
> > > write?  If not then VT-d interrupt remapping offers us no protection
> > > from a malicious guest performing DMA to the same address.  Thanks,
> > >
> >
> > For x86 IRQ remapping doesn't rely on existing DMAR remapping table for
> > DMA. Instead IRQ remapping table is a global table per VT-d engine, shared
> > by all devices behind this VT-d engine. Each IRQ remapping table entry
> > (IRTE) can specify:
> >
> > - whether to validate source-id of the interrupt requests;
> > - whether to verify the whole source id;
> > - whether to verify some bits of the source id;
> > ...
> > (section 9.10 Interrupt Remapping Table Entry (IRTE) in VT-d spec)
> >
> > IRTE index is programmed into corresponding MSI-X entry of a device.
> > When an interrupt message is triggered from that device, IRQ remapping
> > hardware will use the index to find the IRTE entry.
> >
> > As long as host kernel has strict control of MSI-X table, it can choose to
> > not validate source-id as long as IRTE index is trusted.
> >
> > That is why I concerned you cannot do this simply based on whether
> > IRQ remapping is available on x86. If such usage is really required,
> > you'll need some more accurate indicator per iommu structure to tell
> > whether safe to allow mmap MSI-X to user space.
> 
> As argued previously in this thread, there's nothing special about a
> DMA write to memory versus a DMA write to a special address that
> triggers an MSI vector.  If the device is DMA capable, which we assume
> all are, it can be fooled into generating those DMA writes regardless
> of whether we actively block access to the MSI-X vector table itself.

But with DMA remapping above can be blocked.

> 
> With respect to the IRTE, these entries are always under host control,
> and aside from the potential opt-in gap when using the intremap=nosid
> boot option, they always make use of source-id validation as Linux is
> written today.  Compatibility mode support is also disabled.  Nothing
> here changes that.  In fact, I suspect the only advantage to blocking

VFIO as a standalone module shouldn't make assumption on how
IRQ remapping is actually implemented in kernel (especially when
VT-d spec allows no source ID verification).

> MSI-X vector table access w/o interrupt remapping is to avoid obvious
> collisions if it were to be programmed directly, it doesn't actually
> prevent an identical DMA transaction from being generated by other

Kernel can enable DMA remapping but disable IRQ remapping. In such
case identical DMA transaction can be prevented.

> means.  The MSI-X vector table of a device is always considered
> untrusted which is why we require user opt-ins to subvert that
> protection.  Thanks,
> 

I only partially agree with this statement since there is different
trust level for kernel space driver and user space driver.

OS may choose to trust kernel space driver then it can enable IRQ
remapping only for load balance purpose w/o source id verfification. 
In such case MSI-X vector table is trusted and fully under kernel 
control. Allowing to mmap MSI-X table in user space definitely 
breaks that boundary.

Anyway my point is simple. Let's ignore how Linux kernel implements
IRQ remapping on x86 (which may change time to time), and just
focus on architectural possibility. Non-x86 platform may implement
IRQ remapping completely separate from device side, then checking
availability of IRQ remapping is enough to decide whether mmap
MSI-X table is safe. x86 with VT-d can be configured to a mode
requiring host control of both MSI-X entry and IRQ remapping hardware 
(without source id check). In such case it's insufficient to make 
decision simply based on IRQ remapping availability. We need a way 
to query from IRQ remapping module whether it's actually safe to 
mmap MSI-X.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-12 17:47                               ` Alex Williamson
  2016-05-13  2:33                                 ` Tian, Kevin
@ 2016-05-13  2:36                                 ` Tian, Kevin
  1 sibling, 0 replies; 39+ messages in thread
From: Tian, Kevin @ 2016-05-13  2:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

> From: Tian, Kevin
> Sent: Friday, May 13, 2016 10:33 AM
> 
> > means.  The MSI-X vector table of a device is always considered
> > untrusted which is why we require user opt-ins to subvert that
> > protection.  Thanks,
> >
> 
> I only partially agree with this statement since there is different
> trust level for kernel space driver and user space driver.
> 
> OS may choose to trust kernel space driver then it can enable IRQ
> remapping only for load balance purpose w/o source id verfification.
> In such case MSI-X vector table is trusted and fully under kernel
> control. Allowing to mmap MSI-X table in user space definitely
> breaks that boundary.
> 
> Anyway my point is simple. Let's ignore how Linux kernel implements
> IRQ remapping on x86 (which may change time to time), and just
> focus on architectural possibility. Non-x86 platform may implement
> IRQ remapping completely separate from device side, then checking
> availability of IRQ remapping is enough to decide whether mmap
> MSI-X table is safe. x86 with VT-d can be configured to a mode
> requiring host control of both MSI-X entry and IRQ remapping hardware
> (without source id check). In such case it's insufficient to make
> decision simply based on IRQ remapping availability. We need a way
> to query from IRQ remapping module whether it's actually safe to
> mmap MSI-X.
> 

Or another way is to have VFIO explicitly request intel-iommu to
enforce sid check when IRQ remapping is enabled...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-13  2:33                                 ` Tian, Kevin
@ 2016-05-13  5:32                                   ` Alex Williamson
  2016-05-13  6:50                                     ` Tian, Kevin
  2016-05-13  9:16                                     ` David Laight
  0 siblings, 2 replies; 39+ messages in thread
From: Alex Williamson @ 2016-05-13  5:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

On Fri, 13 May 2016 02:33:18 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, May 13, 2016 1:48 AM
> > 
> > On Thu, 12 May 2016 04:53:19 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, May 12, 2016 10:21 AM
> > > >
> > > > On Thu, 12 May 2016 01:19:44 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, May 11, 2016 11:54 PM
> > > > > >
> > > > > > On Wed, 11 May 2016 06:29:06 +0000
> > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > >  
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Thursday, May 05, 2016 11:05 PM
> > > > > > > >
> > > > > > > > On Thu, 5 May 2016 12:15:46 +0000
> > > > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > > > >  
> > > > > > > > > > From: Yongji Xie [mailto:xyjxie@linux.vnet.ibm.com]
> > > > > > > > > > Sent: Thursday, May 05, 2016 7:43 PM
> > > > > > > > > >
> > > > > > > > > > Hi David and Kevin,
> > > > > > > > > >
> > > > > > > > > > On 2016/5/5 17:54, David Laight wrote:
> > > > > > > > > >  
> > > > > > > > > > > From: Tian, Kevin  
> > > > > > > > > > >> Sent: 05 May 2016 10:37  
> > > > > > > > > > > ...  
> > > > > > > > > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > > > > > > > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > > > > > > > > >>> can make sure guest kernel would not touch MSI-X table in
> > > > > > > > > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > > > > > > > > >>>  
> > > > > > > > > > >> Then how do you prevent malicious guest kernel accessing it?  
> > > > > > > > > > > Or a malicious guest driver for an ethernet card setting up
> > > > > > > > > > > the receive buffer ring to contain a single word entry that
> > > > > > > > > > > contains the address associated with an MSI-X interrupt and
> > > > > > > > > > > then using a loopback mode to cause a specific packet be
> > > > > > > > > > > received that writes the required word through that address.
> > > > > > > > > > >
> > > > > > > > > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > > > > > > > > cycle.
> > > > > > > > > > >
> > > > > > > > > > > 	David
> > > > > > > > > > >  
> > > > > > > > > >
> > > > > > > > > > If we have enough permission to load a malicious driver or
> > > > > > > > > > kernel, we can easily break the guest without exposed
> > > > > > > > > > MSI-X table.
> > > > > > > > > >
> > > > > > > > > > I think it should be safe to expose MSI-X table if we can
> > > > > > > > > > make sure that malicious guest driver/kernel can't use
> > > > > > > > > > the MSI-X table to break other guest or host. The
> > > > > > > > > > capability of IRQ remapping could provide this
> > > > > > > > > > kind of protection.
> > > > > > > > > >  
> > > > > > > > >
> > > > > > > > > With IRQ remapping it doesn't mean you can pass through MSI-X
> > > > > > > > > structure to guest. I know actual IRQ remapping might be platform
> > > > > > > > > specific, but at least for Intel VT-d specification, MSI-X entry must
> > > > > > > > > be configured with a remappable format by host kernel which
> > > > > > > > > contains an index into IRQ remapping table. The index will find a
> > > > > > > > > IRQ remapping entry which controls interrupt routing for a specific
> > > > > > > > > device. If you allow a malicious program random index into MSI-X
> > > > > > > > > entry of assigned device, the hole is obvious...
> > > > > > > > >
> > > > > > > > > Above might make sense only for a IRQ remapping implementation
> > > > > > > > > which doesn't rely on extended MSI-X format (e.g. simply based on
> > > > > > > > > BDF). If that's the case for PPC, then you should build MSI-X
> > > > > > > > > passthrough based on this fact instead of general IRQ remapping
> > > > > > > > > enabled or not.  
> > > > > > > >
> > > > > > > > I don't think anyone is expecting that we can expose the MSI-X vector
> > > > > > > > table to the guest and the guest can make direct use of it.  The end
> > > > > > > > goal here is that the guest on a power system is already
> > > > > > > > paravirtualized to not program the device MSI-X by directly writing to
> > > > > > > > the MSI-X vector table.  They have hypercalls for this since they
> > > > > > > > always run virtualized.  Therefore a) they never intend to touch the
> > > > > > > > MSI-X vector table and b) they have sufficient isolation that a guest
> > > > > > > > can only hurt itself by doing so.
> > > > > > > >
> > > > > > > > On x86 we don't have a), our method of programming the MSI-X vector
> > > > > > > > table is to directly write to it. Therefore we will always require QEMU
> > > > > > > > to place a MemoryRegion over the vector table to intercept those
> > > > > > > > accesses.  However with interrupt remapping, we do have b) on x86, which
> > > > > > > > means that we don't need to be so strict in disallowing user accesses
> > > > > > > > to the MSI-X vector table.  It's not useful for configuring MSI-X on
> > > > > > > > the device, but the user should only be able to hurt themselves by
> > > > > > > > writing it directly.  x86 doesn't really get anything out of this
> > > > > > > > change, but it helps this special case on power pretty significantly
> > > > > > > > aiui.  Thanks,
> > > > > > > >  
> > > > > > >
> > > > > > > Allowing guest direct write to MSI-x table has system-wide impact.
> > > > > > > As I explained earlier, hypervisor needs to control "interrupt_index"
> > > > > > > programmed in MSI-X entry, which is used to associate a specific
> > > > > > > IRQ remapping entry. Now if you expose whole MSI-x table to guest,
> > > > > > > it can program random index into MSI-X entry to associate with
> > > > > > > any IRQ remapping entry and then there won't be any isolation per se.
> > > > > > >
> > > > > > > You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
> > > > > > > spec.  
> > > > > >
> > > > > > I think you're extrapolating beyond the vfio interface.  The change
> > > > > > here is to remove the vfio protection of the MSI-X vector table when
> > > > > > the system provides interrupt isolation.  The argument is that this is
> > > > > > safe to do because the hardware protects the host from erroneous and
> > > > > > malicious user programming, but it is not meant to provide a means to
> > > > > > program MSI-X directly through the vector table.  This is effectively  
> > > > >
> > > > > Sorry I didn't get this point. Once we allow userspace to mmap MSI-X
> > > > > table, isn't it equivalent to allowing userspace directly program vector
> > > > > table? Or is there other mechanism to prevent direct programming?  
> > > >
> > > > This would remove the mechanism that prevents direct programming.
> > > > Users can configure the device to perform DMA w/o setting up an IOMMU
> > > > mapping, which generates a fault.  Users can incorrectly manipulate the
> > > > MSI-X vector table instead of using the proper vfio programming
> > > > interface, which generates a fault.
> > > >  
> > > > > > the same as general DMA programming, if the vfio programming model is
> > > > > > not followed the device generates iommu faults.  I do have a concern
> > > > > > that userspace driver writers are going to more often presume they can
> > > > > > use the vector table directly because of this change, but I don't know
> > > > > > that that is sufficient reason to prevent such a change.  They'll
> > > > > > quickly discover the device generates faults on interrupt rather than
> > > > > > working as expected.  
> > > > >
> > > > > If userspace can actually program vector table directly, there is not
> > > > > always fault triggered. As long as MSI-X table is fully under control
> > > > > of userspace, any interrupt index can be used here which may link
> > > > > to a IRQ remapping entry allocated for other devices.  
> > > >
> > > > Is not part of the vector lookup comparing the source ID of the DMA
> > > > write?  If not then VT-d interrupt remapping offers us no protection
> > > > from a malicious guest performing DMA to the same address.  Thanks,
> > > >  
> > >
> > > For x86 IRQ remapping doesn't rely on existing DMAR remapping table for
> > > DMA. Instead IRQ remapping table is a global table per VT-d engine, shared
> > > by all devices behind this VT-d engine. Each IRQ remapping table entry
> > > (IRTE) can specify:
> > >
> > > - whether to validate source-id of the interrupt requests;
> > > - whether to verify the whole source id;
> > > - whether to verify some bits of the source id;
> > > ...
> > > (section 9.10 Interrupt Remapping Table Entry (IRTE) in VT-d spec)
> > >
> > > IRTE index is programmed into corresponding MSI-X entry of a device.
> > > When an interrupt message is triggered from that device, IRQ remapping
> > > hardware will use the index to find the IRTE entry.
> > >
> > > As long as host kernel has strict control of MSI-X table, it can choose to
> > > not validate source-id as long as IRTE index is trusted.
> > >
> > > That is why I concerned you cannot do this simply based on whether
> > > IRQ remapping is available on x86. If such usage is really required,
> > > you'll need some more accurate indicator per iommu structure to tell
> > > whether safe to allow mmap MSI-X to user space.  
> > 
> > As argued previously in this thread, there's nothing special about a
> > DMA write to memory versus a DMA write to a special address that
> > triggers an MSI vector.  If the device is DMA capable, which we assume
> > all are, it can be fooled into generating those DMA writes regardless
> > of whether we actively block access to the MSI-X vector table itself.  
> 
> But with DMA remapping above can be blocked.

How?  VT-d explicitly ignores DMA writes to 0xFEEx_xxxx, section 3.13:

  Write requests without PASID of DWORD length are treated as interrupt
  requests. Interrupt requests are not subjected to DMA remapping[...]
  Instead, remapping hardware can be enabled to subject such interrupt
  requests to interrupt remapping.

> > With respect to the IRTE, these entries are always under host control,
> > and aside from the potential opt-in gap when using the intremap=nosid
> > boot option, they always make use of source-id validation as Linux is
> > written today.  Compatibility mode support is also disabled.  Nothing
> > here changes that.  In fact, I suspect the only advantage to blocking  
> 
> VFIO as a standalone module shouldn't make assumption on how
> IRQ remapping is actually implemented in kernel (especially when
> VT-d spec allows no source ID verification).

vfio is NOT a stand alone module, it is part of the kernel that happens
to be configurable as a module.  We have the ability to change with the
kernel and influence how the kernel behaves.

> > MSI-X vector table access w/o interrupt remapping is to avoid obvious
> > collisions if it were to be programmed directly, it doesn't actually
> > prevent an identical DMA transaction from being generated by other  
> 
> Kernel can enable DMA remapping but disable IRQ remapping. In such
> case identical DMA transaction can be prevented.

Not according to the VT-d spec as quoted above.  If so, how?

> > means.  The MSI-X vector table of a device is always considered
> > untrusted which is why we require user opt-ins to subvert that
> > protection.  Thanks,
> >   
> 
> I only partially agree with this statement since there is different
> trust level for kernel space driver and user space driver.
> 
> OS may choose to trust kernel space driver then it can enable IRQ
> remapping only for load balance purpose w/o source id verfification. 
> In such case MSI-X vector table is trusted and fully under kernel 
> control. Allowing to mmap MSI-X table in user space definitely 
> breaks that boundary.

The OS may choose to do this, but we're part of the OS and it doesn't
do this and has no reason to do this.

> Anyway my point is simple. Let's ignore how Linux kernel implements
> IRQ remapping on x86 (which may change time to time), and just
> focus on architectural possibility. Non-x86 platform may implement
> IRQ remapping completely separate from device side, then checking
> availability of IRQ remapping is enough to decide whether mmap
> MSI-X table is safe. x86 with VT-d can be configured to a mode
> requiring host control of both MSI-X entry and IRQ remapping hardware 
> (without source id check). In such case it's insufficient to make 
> decision simply based on IRQ remapping availability. We need a way 
> to query from IRQ remapping module whether it's actually safe to 
> mmap MSI-X.

We're going in circles here.  This patch is attempting to remove
protection from the MSI-X vector table that is really nothing more than
security theater already.  That "protection" only actually prevents
casual misuse of the API which is really only a problem when the
platform offers no form of interrupt isolation, such as VT-d with DMA
remapping but not interrupt remapping.  Disabling source-id checking in
VT-d should be handled as the equivalent of disabling interrupt
remapping altogether as far as the IOMMU API is concerned.  That's
a trivial gap that should be fixed.  There is no such thing as a secure
MSI-X vector table when untrusted userspace drivers are involved, we
must always assume that a device can generate DMA writes that are
indistinguishable from actual interrupt requests and if the platform
does not provide interrupt isolation we should require the user to
opt-in to an unsafe mode.

Simply denying direct writes to the vector table or preventing mapping
of the vector table into the user address space does not provide any
tangible form of protection.  Many devices make use of window registers
that allow backdoors to arbitrary device registers.  Some drivers even
use this as the primary means for configuring MSI-X, which makes them
incompatible with device assignment without device specific quirks to
enable virtualization of these paths.

If you have an objection to this patch, please show me how preventing
direct CPU access to the MSI-X vector table provides any kind of
security guarantee of the contents of the vector table and also prove
to me that a device cannot spoof a DMA write that is indistinguishable
from one associated with an actual interrupt, regardless of the
contents of the MSI-X vector table.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-13  5:32                                   ` Alex Williamson
@ 2016-05-13  6:50                                     ` Tian, Kevin
  2016-05-13 16:42                                       ` Alex Williamson
  2016-05-13  9:16                                     ` David Laight
  1 sibling, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2016-05-13  6:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, May 13, 2016 1:33 PM
> > >
> > > As argued previously in this thread, there's nothing special about a
> > > DMA write to memory versus a DMA write to a special address that
> > > triggers an MSI vector.  If the device is DMA capable, which we assume
> > > all are, it can be fooled into generating those DMA writes regardless
> > > of whether we actively block access to the MSI-X vector table itself.
> >
> > But with DMA remapping above can be blocked.
> 
> How?  VT-d explicitly ignores DMA writes to 0xFEEx_xxxx, section 3.13:
> 
>   Write requests without PASID of DWORD length are treated as interrupt
>   requests. Interrupt requests are not subjected to DMA remapping[...]
>   Instead, remapping hardware can be enabled to subject such interrupt
>   requests to interrupt remapping.

Thanks for catching this!

> 
> > > MSI-X vector table access w/o interrupt remapping is to avoid obvious
> > > collisions if it were to be programmed directly, it doesn't actually
> > > prevent an identical DMA transaction from being generated by other
> >
> > Kernel can enable DMA remapping but disable IRQ remapping. In such
> > case identical DMA transaction can be prevented.
> 
> Not according to the VT-d spec as quoted above.  If so, how?

So my argument on this is wrong. sorry.

> 
> > Anyway my point is simple. Let's ignore how Linux kernel implements
> > IRQ remapping on x86 (which may change time to time), and just
> > focus on architectural possibility. Non-x86 platform may implement
> > IRQ remapping completely separate from device side, then checking
> > availability of IRQ remapping is enough to decide whether mmap
> > MSI-X table is safe. x86 with VT-d can be configured to a mode
> > requiring host control of both MSI-X entry and IRQ remapping hardware
> > (without source id check). In such case it's insufficient to make
> > decision simply based on IRQ remapping availability. We need a way
> > to query from IRQ remapping module whether it's actually safe to
> > mmap MSI-X.
> 
> We're going in circles here.  This patch is attempting to remove
> protection from the MSI-X vector table that is really nothing more than
> security theater already.  That "protection" only actually prevents
> casual misuse of the API which is really only a problem when the
> platform offers no form of interrupt isolation, such as VT-d with DMA
> remapping but not interrupt remapping.  Disabling source-id checking in
> VT-d should be handled as the equivalent of disabling interrupt
> remapping altogether as far as the IOMMU API is concerned.  That's
> a trivial gap that should be fixed.  There is no such thing as a secure

That is the main change I'm asking against original patch, which has:

+static void pci_check_msi_remapping(struct pci_dev *pdev,
+					const struct iommu_ops *ops)
+{
+	struct pci_bus *bus = pdev->bus;
+
+	if (ops->capable(IOMMU_CAP_INTR_REMAP) &&
+		!(bus->bus_flags & PCI_BUS_FLAGS_MSI_REMAP))
+		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
+}
+

Above flag should be cleared when source-id checking is disabled on x86. 
Yes, VFIO is part of OS but any assumption we made about other parts
needs to be reflected accurately in the code.

> MSI-X vector table when untrusted userspace drivers are involved, we
> must always assume that a device can generate DMA writes that are
> indistinguishable from actual interrupt requests and if the platform
> does not provide interrupt isolation we should require the user to
> opt-in to an unsafe mode.
> 
> Simply denying direct writes to the vector table or preventing mapping
> of the vector table into the user address space does not provide any
> tangible form of protection.  Many devices make use of window registers
> that allow backdoors to arbitrary device registers.  Some drivers even
> use this as the primary means for configuring MSI-X, which makes them
> incompatible with device assignment without device specific quirks to
> enable virtualization of these paths.
> 
> If you have an objection to this patch, please show me how preventing
> direct CPU access to the MSI-X vector table provides any kind of
> security guarantee of the contents of the vector table and also prove
> to me that a device cannot spoof a DMA write that is indistinguishable
> from one associated with an actual interrupt, regardless of the
> contents of the MSI-X vector table.  Thanks,
> 

I'm not object to the whole patch series. As replied above, my point
is just that current condition of allowing mmap MSI-X in this patch is not 
accurate, but my argument on security manner is not correct. Thanks
for your elaboration to make it clear.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-13  5:32                                   ` Alex Williamson
  2016-05-13  6:50                                     ` Tian, Kevin
@ 2016-05-13  9:16                                     ` David Laight
  1 sibling, 0 replies; 39+ messages in thread
From: David Laight @ 2016-05-13  9:16 UTC (permalink / raw)
  To: 'Alex Williamson', Tian, Kevin
  Cc: Yongji Xie, kvm, linux-kernel, linux-pci, linuxppc-dev, iommu,
	bhelgaas, aik, benh, paulus, mpe, joro, warrier, zhong, nikunj,
	eric.auger, will.deacon, gwshan, alistair, ruscur

From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 13 May 2016 06:33
...
> Simply denying direct writes to the vector table or preventing mapping
> of the vector table into the user address space does not provide any
> tangible form of protection.  Many devices make use of window registers
> that allow backdoors to arbitrary device registers.  Some drivers even
> use this as the primary means for configuring MSI-X, which makes them
> incompatible with device assignment without device specific quirks to
> enable virtualization of these paths.

We have one fgpa based PCIe slave where the device driver has to read
the MSI-X table and then write the value to other fpga registers so
that the logic can generate the correct PCIe write cycle when an
interrupt is requested.
The MSI-X table itself is only as a PCIe slave.

We also have host accessible DMA controllers that the device driver
uses to copy data to kernel memory.
These could easily be used to generate arbitrary MSI-X requests.
As I've said earlier it is almost certainly possible to get any
ethernet hardware to perform something similar.

So without hardware that is able to limit the memory and MSI-X
that each PCIe endpoint can access I believe that if a virtualisation
system gives a guest kernel direct access to a PCIe devices it gives
the guest kernel the ability to raise and MSI-X interrupt and read/write
any physical memory.
(I've not looked at the cpu virtualisation support, but do know what
the PCIe devices can do.)

More interestingly, probably the 'worst' thing (from a security point of view)
that changing the MSI-X table lets you do is a write to an arbitrary
physical memory address.

	David

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
  2016-05-13  6:50                                     ` Tian, Kevin
@ 2016-05-13 16:42                                       ` Alex Williamson
  0 siblings, 0 replies; 39+ messages in thread
From: Alex Williamson @ 2016-05-13 16:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yongji Xie, David Laight, kvm, linux-kernel, linux-pci,
	linuxppc-dev, iommu, bhelgaas, aik, benh, paulus, mpe, joro,
	warrier, zhong, nikunj, eric.auger, will.deacon, gwshan,
	alistair, ruscur

On Fri, 13 May 2016 06:50:25 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, May 13, 2016 1:33 PM  
> > > >
> > > > As argued previously in this thread, there's nothing special about a
> > > > DMA write to memory versus a DMA write to a special address that
> > > > triggers an MSI vector.  If the device is DMA capable, which we assume
> > > > all are, it can be fooled into generating those DMA writes regardless
> > > > of whether we actively block access to the MSI-X vector table itself.  
> > >
> > > But with DMA remapping above can be blocked.  
> > 
> > How?  VT-d explicitly ignores DMA writes to 0xFEEx_xxxx, section 3.13:
> > 
> >   Write requests without PASID of DWORD length are treated as interrupt
> >   requests. Interrupt requests are not subjected to DMA remapping[...]
> >   Instead, remapping hardware can be enabled to subject such interrupt
> >   requests to interrupt remapping.  
> 
> Thanks for catching this!
> 
> >   
> > > > MSI-X vector table access w/o interrupt remapping is to avoid obvious
> > > > collisions if it were to be programmed directly, it doesn't actually
> > > > prevent an identical DMA transaction from being generated by other  
> > >
> > > Kernel can enable DMA remapping but disable IRQ remapping. In such
> > > case identical DMA transaction can be prevented.  
> > 
> > Not according to the VT-d spec as quoted above.  If so, how?  
> 
> So my argument on this is wrong. sorry.
> 
> >   
> > > Anyway my point is simple. Let's ignore how Linux kernel implements
> > > IRQ remapping on x86 (which may change time to time), and just
> > > focus on architectural possibility. Non-x86 platform may implement
> > > IRQ remapping completely separate from device side, then checking
> > > availability of IRQ remapping is enough to decide whether mmap
> > > MSI-X table is safe. x86 with VT-d can be configured to a mode
> > > requiring host control of both MSI-X entry and IRQ remapping hardware
> > > (without source id check). In such case it's insufficient to make
> > > decision simply based on IRQ remapping availability. We need a way
> > > to query from IRQ remapping module whether it's actually safe to
> > > mmap MSI-X.  
> > 
> > We're going in circles here.  This patch is attempting to remove
> > protection from the MSI-X vector table that is really nothing more than
> > security theater already.  That "protection" only actually prevents
> > casual misuse of the API which is really only a problem when the
> > platform offers no form of interrupt isolation, such as VT-d with DMA
> > remapping but not interrupt remapping.  Disabling source-id checking in
> > VT-d should be handled as the equivalent of disabling interrupt
> > remapping altogether as far as the IOMMU API is concerned.  That's
> > a trivial gap that should be fixed.  There is no such thing as a secure  
> 
> That is the main change I'm asking against original patch, which has:
> 
> +static void pci_check_msi_remapping(struct pci_dev *pdev,
> +					const struct iommu_ops *ops)
> +{
> +	struct pci_bus *bus = pdev->bus;
> +
> +	if (ops->capable(IOMMU_CAP_INTR_REMAP) &&
> +		!(bus->bus_flags & PCI_BUS_FLAGS_MSI_REMAP))
> +		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
> +}
> +
> 
> Above flag should be cleared when source-id checking is disabled on x86. 
> Yes, VFIO is part of OS but any assumption we made about other parts
> needs to be reflected accurately in the code.

I would say this is an independent bug which should be fixed simply as:

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e1852e8..60d55c0 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4948,7 +4948,7 @@ static bool intel_iommu_capable(enum iommu_cap cap)
        if (cap == IOMMU_CAP_CACHE_COHERENCY)
                return domain_update_iommu_snooping(NULL) == 1;
        if (cap == IOMMU_CAP_INTR_REMAP)
-               return irq_remapping_enabled == 1;
+               return irq_remapping_enabled == 1 && !disable_sourceid_checking;
 
        return false;
 }

I believe the intent of the IOMMU_CAP_INTR_REMAP flag is simply to
indicate interrupt isolation is provided through the IOMMU.  Nobody
cares about the interrupt remapping support beyond that.  If source-id
checking is disabled, the remainder of interrupt remapping is
irrelevant as far as this capability is concerned imho.  Thanks,

Alex

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag
  2016-04-27 12:43 ` [PATCH 1/5] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag Yongji Xie
@ 2016-05-24 20:55   ` Bjorn Helgaas
  2016-05-25  5:46     ` Yongji Xie
  0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2016-05-24 20:55 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu,
	alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On Wed, Apr 27, 2016 at 08:43:26PM +0800, Yongji Xie wrote:
> We introduce a new pci_bus_flags, PCI_BUS_FLAGS_MSI_REMAP
> which indicates all devices on the bus are protected by the
> hardware which supports IRQ remapping(intel naming).

This changelog is ambiguous.  It's possible that there is hardware
that *supports* IRQ remapping, but does not actually *do* IRQ
remapping.  For example, an IRQ remapping capability may be present
but not enabled.

I think your intent is to set this flag only when MSI remapping is
actually *enabled* for all devices on the bus.

I'd also like to know exactly what protection is implied by
PCI_BUS_FLAGS_MSI_REMAP and IOMMU_CAP_INTR_REMAP.  I guess it means a
device can only generate MSIs to a certain set of CPUs?  I assume the
remapping hardware only checks the target address, not the data being
written?

> This flag will be used to know whether it's safe to expose
> MSI-X tables of PCI BARs to userspace. Because the capability
> of IRQ remapping can guarantee the PCI device cannot trigger
> MSIs that correspond to interrupt IDs of other devices.
> 
> There is a existing flag for this in the IOMMU space:
> 
> enum iommu_cap {
> 	IOMMU_CAP_CACHE_COHERENCY,
> --->	IOMMU_CAP_INTR_REMAP,
> 	IOMMU_CAP_NOEXEC,
> };
> 
> and Eric also posted a patchset [1] to abstract this
> capability on MSI controller side for ARM. But it would
> make sense to have a more common flag like
> PCI_BUS_FLAGS_MSI_REMAP in this patch so that we can use
> a universal flag to test this capability on PCI side for
> different archs.
> 
> [1] http://www.spinics.net/lists/kvm/msg130256.html
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  include/linux/pci.h |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 27df4a6..d619228 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -193,6 +193,7 @@ typedef unsigned short __bitwise pci_bus_flags_t;
>  enum pci_bus_flags {
>  	PCI_BUS_FLAGS_NO_MSI   = (__force pci_bus_flags_t) 1,
>  	PCI_BUS_FLAGS_NO_MMRBC = (__force pci_bus_flags_t) 2,
> +	PCI_BUS_FLAGS_MSI_REMAP = (__force pci_bus_flags_t) 4,
>  };
>  
>  /* These values come from the PCI Express Spec */
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/5] PCI: Set PCI_BUS_FLAGS_MSI_REMAP if MSI controller supports IRQ remapping
  2016-04-27 12:43 ` [PATCH 3/5] PCI: Set PCI_BUS_FLAGS_MSI_REMAP if MSI controller supports " Yongji Xie
@ 2016-05-24 21:04   ` Bjorn Helgaas
  2016-05-25  5:48     ` Yongji Xie
  0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2016-05-24 21:04 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu,
	alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On Wed, Apr 27, 2016 at 08:43:28PM +0800, Yongji Xie wrote:
> On ARM HW the capability of IRQ remapping is abstracted on
> MSI controller side. MSI_FLAG_IRQ_REMAPPING is used to advertise
> this [1].
> 
> To have a universal flag to test this capability for different
> archs on PCI side, we set PCI_BUS_FLAGS_MSI_REMAP for PCI buses
> when MSI_FLAG_IRQ_REMAPPING is set.
> 
> [1] http://www.spinics.net/lists/kvm/msg130256.html
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  drivers/pci/msi.c   |   12 ++++++++++++
>  drivers/pci/probe.c |    3 +++
>  include/linux/msi.h |    6 +++++-
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index a080f44..1661cdf 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1134,6 +1134,18 @@ void *msi_desc_to_pci_sysdata(struct msi_desc *desc)
>  }
>  EXPORT_SYMBOL_GPL(msi_desc_to_pci_sysdata);
>  
> +void pci_bus_check_msi_remapping(struct pci_bus *bus,
> +				 struct irq_domain *domain)
> +{
> +#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
> +	struct msi_domain_info *info;
> +
> +	info = msi_get_domain_info(domain);
> +	if (info->flags & MSI_FLAG_IRQ_REMAPPING)
> +		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
> +#endif
> +}

Functions named "check_foo" are a pet peeve of mine because the name
doesn't tell us anything about what the function *does*.  In this
case, we know it checks something about MSI remapping, but we don't
know whether we're checking whether it's enabled, disabled, or some
other property.

I'd prefer something like:

  int pci_bus_msi_isolated(struct pci_bus *bus, struct irq_domain *domain)
  {
    struct msi_domain_info *info;

    if (!domain)
      return 0;

    info = msi_get_domain_info(domain);
    if (info->flags & MSI_FLAG_IRQ_REMAPPING)
      return 1;

    return 0;
  }

  void pci_set_bus_msi_domain(struct pci_bus *bus)
  {
    ...
    if (b == bus && pci_bus_msi_isolated(bus, d))
      bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;

>  #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
>  /**
>   * pci_msi_domain_write_msg - Helper to write MSI message to PCI config space
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 6d7ab9b..25cf1b1 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -696,6 +696,9 @@ static void pci_set_bus_msi_domain(struct pci_bus *bus)
>  	if (!d)
>  		d = pci_host_bridge_msi_domain(b);
>  
> +	if (d && b == bus)
> +		pci_bus_check_msi_remapping(bus, d);
> +
>  	dev_set_msi_domain(&bus->dev, d);
>  }
>  
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 03eda72..b4c649e 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -15,6 +15,8 @@ extern int pci_msi_ignore_mask;
>  struct irq_data;
>  struct msi_desc;
>  struct pci_dev;
> +struct pci_bus;
> +struct irq_domain;
>  struct platform_msi_priv_data;
>  void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
>  void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg);
> @@ -155,6 +157,9 @@ void arch_restore_msi_irqs(struct pci_dev *dev);
>  void default_teardown_msi_irqs(struct pci_dev *dev);
>  void default_restore_msi_irqs(struct pci_dev *dev);
>  
> +void pci_bus_check_msi_remapping(struct pci_bus *bus,
> +				 struct irq_domain *domain);
> +
>  struct msi_controller {
>  	struct module *owner;
>  	struct device *dev;
> @@ -173,7 +178,6 @@ struct msi_controller {
>  #include <linux/irqhandler.h>
>  #include <asm/msi.h>
>  
> -struct irq_domain;
>  struct irq_domain_ops;
>  struct irq_chip;
>  struct device_node;
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping
  2016-04-27 12:43 ` [PATCH 2/5] iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping Yongji Xie
@ 2016-05-24 21:11   ` Bjorn Helgaas
  2016-05-25  5:54     ` Yongji Xie
       [not found]     ` <201605250554.u4P5sRqv014439@mx0a-001b2d01.pphosted.com>
  0 siblings, 2 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2016-05-24 21:11 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu,
	alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On Wed, Apr 27, 2016 at 08:43:27PM +0800, Yongji Xie wrote:
> The capability of IRQ remapping is abstracted on IOMMU side on
> some archs. There is a existing flag IOMMU_CAP_INTR_REMAP for this.
> 
> To have a universal flag to test this capability for different
> archs on PCI side, we set PCI_BUS_FLAGS_MSI_REMAP for PCI buses
> when IOMMU_CAP_INTR_REMAP is set.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  drivers/iommu/iommu.c |   15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 0e3b009..5d2b6f6 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -813,6 +813,16 @@ struct iommu_group *pci_device_group(struct device *dev)
>  	return group;
>  }
>  
> +static void pci_check_msi_remapping(struct pci_dev *pdev,
> +					const struct iommu_ops *ops)
> +{
> +	struct pci_bus *bus = pdev->bus;
> +
> +	if (ops->capable(IOMMU_CAP_INTR_REMAP) &&
> +		!(bus->bus_flags & PCI_BUS_FLAGS_MSI_REMAP))
> +		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
> +}

This looks an awful lot like the pci_bus_check_msi_remapping() you add
elsewhere.  Why do we need both?

>  /**
>   * iommu_group_get_for_dev - Find or create the IOMMU group for a device
>   * @dev: target device
> @@ -871,6 +881,9 @@ static int add_iommu_group(struct device *dev, void *data)
>  	const struct iommu_ops *ops = cb->ops;
>  	int ret;
>  
> +	if (dev_is_pci(dev) && ops->capable)
> +		pci_check_msi_remapping(to_pci_dev(dev), ops);
> +
>  	if (!ops->add_device)
>  		return 0;
>  
> @@ -913,6 +926,8 @@ static int iommu_bus_notifier(struct notifier_block *nb,
>  	 * result in ADD/DEL notifiers to group->notifier
>  	 */
>  	if (action == BUS_NOTIFY_ADD_DEVICE) {
> +		if (dev_is_pci(dev) && ops->capable)
> +			pci_check_msi_remapping(to_pci_dev(dev), ops);

These calls don't smell right either.  Why do we need dev_is_pci()
checks here?  Can't this be done in the PCI probe path somehow, e.g.,
in pci_set_bus_msi_domain() or something?

>  		if (ops->add_device)
>  			return ops->add_device(dev);
>  	} else if (action == BUS_NOTIFY_REMOVED_DEVICE) {
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag
  2016-05-24 20:55   ` Bjorn Helgaas
@ 2016-05-25  5:46     ` Yongji Xie
  0 siblings, 0 replies; 39+ messages in thread
From: Yongji Xie @ 2016-05-25  5:46 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu,
	alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On 2016/5/25 4:55, Bjorn Helgaas wrote:

> On Wed, Apr 27, 2016 at 08:43:26PM +0800, Yongji Xie wrote:
>> We introduce a new pci_bus_flags, PCI_BUS_FLAGS_MSI_REMAP
>> which indicates all devices on the bus are protected by the
>> hardware which supports IRQ remapping(intel naming).
> This changelog is ambiguous.  It's possible that there is hardware
> that *supports* IRQ remapping, but does not actually *do* IRQ
> remapping.  For example, an IRQ remapping capability may be present
> but not enabled.
>
> I think your intent is to set this flag only when MSI remapping is
> actually *enabled* for all devices on the bus.

Yes. This is exactly my intent. Thank you for the correction!

> I'd also like to know exactly what protection is implied by
> PCI_BUS_FLAGS_MSI_REMAP and IOMMU_CAP_INTR_REMAP.  I guess it means a
> device can only generate MSIs to a certain set of CPUs?  I assume the
> remapping hardware only checks the target address, not the data being
> written?

When IRQ remapping is enabled, the hardware will check both target
address and data, then compute the interrupt_index from them.
Interrupt_index will be used to find a specific Interrupt Remapping Table
Entry containing some fields which could be used to identify a device or
a group of devices(these devices should be in the same isolation domain).
Then hardware can use this to verify the interrupt request. If the interrupt
request is not from the specific devices, it will be blocked.

So this flag indicate that the hardware can ensure that a given PCI device
can only shoot the MSIs assigned for it. When there is something wrong
with MSI in device or device driver, this can prevent all damage from it.

Regards,
Yongji

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/5] PCI: Set PCI_BUS_FLAGS_MSI_REMAP if MSI controller supports IRQ remapping
  2016-05-24 21:04   ` Bjorn Helgaas
@ 2016-05-25  5:48     ` Yongji Xie
  0 siblings, 0 replies; 39+ messages in thread
From: Yongji Xie @ 2016-05-25  5:48 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu,
	alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On 2016/5/25 5:04, Bjorn Helgaas wrote:

> On Wed, Apr 27, 2016 at 08:43:28PM +0800, Yongji Xie wrote:
>> On ARM HW the capability of IRQ remapping is abstracted on
>> MSI controller side. MSI_FLAG_IRQ_REMAPPING is used to advertise
>> this [1].
>>
>> To have a universal flag to test this capability for different
>> archs on PCI side, we set PCI_BUS_FLAGS_MSI_REMAP for PCI buses
>> when MSI_FLAG_IRQ_REMAPPING is set.
>>
>> [1] http://www.spinics.net/lists/kvm/msg130256.html
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   drivers/pci/msi.c   |   12 ++++++++++++
>>   drivers/pci/probe.c |    3 +++
>>   include/linux/msi.h |    6 +++++-
>>   3 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index a080f44..1661cdf 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -1134,6 +1134,18 @@ void *msi_desc_to_pci_sysdata(struct msi_desc *desc)
>>   }
>>   EXPORT_SYMBOL_GPL(msi_desc_to_pci_sysdata);
>>   
>> +void pci_bus_check_msi_remapping(struct pci_bus *bus,
>> +				 struct irq_domain *domain)
>> +{
>> +#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
>> +	struct msi_domain_info *info;
>> +
>> +	info = msi_get_domain_info(domain);
>> +	if (info->flags & MSI_FLAG_IRQ_REMAPPING)
>> +		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
>> +#endif
>> +}
> Functions named "check_foo" are a pet peeve of mine because the name
> doesn't tell us anything about what the function *does*.  In this
> case, we know it checks something about MSI remapping, but we don't
> know whether we're checking whether it's enabled, disabled, or some
> other property.
>
> I'd prefer something like:
>
>    int pci_bus_msi_isolated(struct pci_bus *bus, struct irq_domain *domain)
>    {
>      struct msi_domain_info *info;
>
>      if (!domain)
>        return 0;
>
>      info = msi_get_domain_info(domain);
>      if (info->flags & MSI_FLAG_IRQ_REMAPPING)
>        return 1;
>
>      return 0;
>    }
>
>    void pci_set_bus_msi_domain(struct pci_bus *bus)
>    {
>      ...
>      if (b == bus && pci_bus_msi_isolated(bus, d))
>        bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;

Yes. This looks more reasonable. Thank you!

Regards,
Yongji

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping
  2016-05-24 21:11   ` Bjorn Helgaas
@ 2016-05-25  5:54     ` Yongji Xie
       [not found]     ` <201605250554.u4P5sRqv014439@mx0a-001b2d01.pphosted.com>
  1 sibling, 0 replies; 39+ messages in thread
From: Yongji Xie @ 2016-05-25  5:54 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu,
	alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On 2016/5/25 5:11, Bjorn Helgaas wrote:

> On Wed, Apr 27, 2016 at 08:43:27PM +0800, Yongji Xie wrote:
>> The capability of IRQ remapping is abstracted on IOMMU side on
>> some archs. There is a existing flag IOMMU_CAP_INTR_REMAP for this.
>>
>> To have a universal flag to test this capability for different
>> archs on PCI side, we set PCI_BUS_FLAGS_MSI_REMAP for PCI buses
>> when IOMMU_CAP_INTR_REMAP is set.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   drivers/iommu/iommu.c |   15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 0e3b009..5d2b6f6 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -813,6 +813,16 @@ struct iommu_group *pci_device_group(struct device *dev)
>>   	return group;
>>   }
>>   
>> +static void pci_check_msi_remapping(struct pci_dev *pdev,
>> +					const struct iommu_ops *ops)
>> +{
>> +	struct pci_bus *bus = pdev->bus;
>> +
>> +	if (ops->capable(IOMMU_CAP_INTR_REMAP) &&
>> +		!(bus->bus_flags & PCI_BUS_FLAGS_MSI_REMAP))
>> +		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
>> +}
> This looks an awful lot like the pci_bus_check_msi_remapping() you add
> elsewhere.  Why do we need both?

I will modify this function as you suggested.  And we add this function
here because some iommu drivers would be initialed after PCI probing.

>>   /**
>>    * iommu_group_get_for_dev - Find or create the IOMMU group for a device
>>    * @dev: target device
>> @@ -871,6 +881,9 @@ static int add_iommu_group(struct device *dev, void *data)
>>   	const struct iommu_ops *ops = cb->ops;
>>   	int ret;
>>   
>> +	if (dev_is_pci(dev) && ops->capable)
>> +		pci_check_msi_remapping(to_pci_dev(dev), ops);
>> +
>>   	if (!ops->add_device)
>>   		return 0;
>>   
>> @@ -913,6 +926,8 @@ static int iommu_bus_notifier(struct notifier_block *nb,
>>   	 * result in ADD/DEL notifiers to group->notifier
>>   	 */
>>   	if (action == BUS_NOTIFY_ADD_DEVICE) {
>> +		if (dev_is_pci(dev) && ops->capable)
>> +			pci_check_msi_remapping(to_pci_dev(dev), ops);
> These calls don't smell right either.  Why do we need dev_is_pci()
> checks here?

Some platform devices may also call this.

> Can't this be done in the PCI probe path somehow, e.g.,
> in pci_set_bus_msi_domain() or something?
>

Yes, this can be done in pci_create_root_bus(). But it could only
handle the case that iommu drivers are initialed before PCI probing.

Regards,
Yongji

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping
       [not found]     ` <201605250554.u4P5sRqv014439@mx0a-001b2d01.pphosted.com>
@ 2016-05-26  3:48       ` Bjorn Helgaas
  0 siblings, 0 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2016-05-26  3:48 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, iommu,
	alex.williamson, bhelgaas, aik, benh, paulus, mpe, joro, warrier,
	zhong, nikunj, eric.auger, will.deacon, gwshan, David.Laight,
	alistair, ruscur

On Wed, May 25, 2016 at 01:54:23PM +0800, Yongji Xie wrote:
> On 2016/5/25 5:11, Bjorn Helgaas wrote:
> >On Wed, Apr 27, 2016 at 08:43:27PM +0800, Yongji Xie wrote:
> >>The capability of IRQ remapping is abstracted on IOMMU side on
> >>some archs. There is a existing flag IOMMU_CAP_INTR_REMAP for this.
> >>
> >>To have a universal flag to test this capability for different
> >>archs on PCI side, we set PCI_BUS_FLAGS_MSI_REMAP for PCI buses
> >>when IOMMU_CAP_INTR_REMAP is set.
> >>
> >>Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> >>---
> >>  drivers/iommu/iommu.c |   15 +++++++++++++++
> >>  1 file changed, 15 insertions(+)
> >>
> >>diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >>index 0e3b009..5d2b6f6 100644
> >>--- a/drivers/iommu/iommu.c
> >>+++ b/drivers/iommu/iommu.c
> >>@@ -813,6 +813,16 @@ struct iommu_group *pci_device_group(struct device *dev)
> >>  	return group;
> >>  }
> >>+static void pci_check_msi_remapping(struct pci_dev *pdev,
> >>+					const struct iommu_ops *ops)
> >>+{
> >>+	struct pci_bus *bus = pdev->bus;
> >>+
> >>+	if (ops->capable(IOMMU_CAP_INTR_REMAP) &&
> >>+		!(bus->bus_flags & PCI_BUS_FLAGS_MSI_REMAP))
> >>+		bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
> >>+}
> >This looks an awful lot like the pci_bus_check_msi_remapping() you add
> >elsewhere.  Why do we need both?
> 
> I will modify this function as you suggested.  And we add this function
> here because some iommu drivers would be initialed after PCI probing.
>
> >>  /**
> >>   * iommu_group_get_for_dev - Find or create the IOMMU group for a device
> >>   * @dev: target device
> >>@@ -871,6 +881,9 @@ static int add_iommu_group(struct device *dev, void *data)
> >>  	const struct iommu_ops *ops = cb->ops;
> >>  	int ret;
> >>+	if (dev_is_pci(dev) && ops->capable)
> >>+		pci_check_msi_remapping(to_pci_dev(dev), ops);
> >>+
> >>  	if (!ops->add_device)
> >>  		return 0;
> >>@@ -913,6 +926,8 @@ static int iommu_bus_notifier(struct notifier_block *nb,
> >>  	 * result in ADD/DEL notifiers to group->notifier
> >>  	 */
> >>  	if (action == BUS_NOTIFY_ADD_DEVICE) {
> >>+		if (dev_is_pci(dev) && ops->capable)
> >>+			pci_check_msi_remapping(to_pci_dev(dev), ops);
> >These calls don't smell right either.  Why do we need dev_is_pci()
> >checks here?
> 
> Some platform devices may also call this.
> 
> >Can't this be done in the PCI probe path somehow, e.g.,
> >in pci_set_bus_msi_domain() or something?
> 
> Yes, this can be done in pci_create_root_bus(). But it could only
> handle the case that iommu drivers are initialed before PCI probing.

Hmm, that's sort of what I expected, but I don't like it.  I don't
think initializing IOMMU drivers after probing PCI devices is the
optimal design.  As soon as we add a PCI device, a driver can bind to
it and start using it.  If we set up an IOMMU after a driver has
claimed the device, something is going to break.  The driver may have
already made DMA mappings that would now be invalid.

IOMMU drivers are logically between the CPU and the device, so they
should be initialized before the device is enumerated.  I know there
are probably some legacy issues behind this design, and I know it has
nothing to do with your patch, so this is not a very constructive
comment.

I just want to be on record that I'm not a fan of this out-of-order
initialization, and I don't like the notification scheme for handling
device hotplug either.  Notifiers make the code hard to read, and you
can't tell what order things happen in.  It just seems like a hack to
connect things together when they should really have been designed to
be more tightly integrated from the beginning.

Bjorn

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-05-26  3:49 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-27 12:43 [PATCH 0/5] vfio-pci: Add support for mmapping MSI-X table Yongji Xie
2016-04-27 12:43 ` [PATCH 1/5] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag Yongji Xie
2016-05-24 20:55   ` Bjorn Helgaas
2016-05-25  5:46     ` Yongji Xie
2016-04-27 12:43 ` [PATCH 2/5] iommu: Set PCI_BUS_FLAGS_MSI_REMAP if IOMMU have capability of IRQ remapping Yongji Xie
2016-05-24 21:11   ` Bjorn Helgaas
2016-05-25  5:54     ` Yongji Xie
     [not found]     ` <201605250554.u4P5sRqv014439@mx0a-001b2d01.pphosted.com>
2016-05-26  3:48       ` Bjorn Helgaas
2016-04-27 12:43 ` [PATCH 3/5] PCI: Set PCI_BUS_FLAGS_MSI_REMAP if MSI controller supports " Yongji Xie
2016-05-24 21:04   ` Bjorn Helgaas
2016-05-25  5:48     ` Yongji Xie
2016-04-27 12:43 ` [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge Yongji Xie
2016-05-06  6:34   ` Alexey Kardashevskiy
2016-04-27 12:43 ` [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported Yongji Xie
2016-05-03  5:34   ` Tian, Kevin
2016-05-03  6:08     ` Yongji Xie
2016-05-03  6:22       ` Tian, Kevin
2016-05-03  7:34         ` Yongji Xie
2016-05-05  9:36           ` Tian, Kevin
2016-05-05  9:54             ` David Laight
2016-05-05 11:42               ` Yongji Xie
2016-05-05 12:15                 ` Tian, Kevin
2016-05-05 13:28                   ` Yongji Xie
2016-05-05 15:05                   ` Alex Williamson
2016-05-06  6:35                     ` Alexey Kardashevskiy
2016-05-06 16:54                       ` Alex Williamson
2016-05-11  6:29                     ` Tian, Kevin
2016-05-11 15:53                       ` Alex Williamson
2016-05-12  1:19                         ` Tian, Kevin
2016-05-12  2:20                           ` Alex Williamson
2016-05-12  4:53                             ` Tian, Kevin
2016-05-12 17:47                               ` Alex Williamson
2016-05-13  2:33                                 ` Tian, Kevin
2016-05-13  5:32                                   ` Alex Williamson
2016-05-13  6:50                                     ` Tian, Kevin
2016-05-13 16:42                                       ` Alex Williamson
2016-05-13  9:16                                     ` David Laight
2016-05-13  2:36                                 ` Tian, Kevin
2016-05-05 11:44             ` Yongji Xie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).