All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-07  7:25 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel


This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
http://www.spinics.net/lists/kvm/msg152232.html

This time it is using "caps" in IOMMU groups. The main question is if PCI
bus flags or IOMMU domains are still better (and which one).



Here is some background:

Current vfio-pci implementation disallows to mmap the page
containing MSI-X table in case that users can write directly
to MSI-X table and generate an incorrect MSIs.

However, this will cause some performance issue when there
are some critical device registers in the same page as the
MSI-X table. We have to handle the mmio access to these
registers in QEMU emulation rather than in guest.

To solve this issue, this series allows to expose MSI-X table
to userspace when hardware enables the capability of interrupt
remapping which can ensure that a given PCI device can only
shoot the MSIs assigned for it. And we introduce a new bus_flags
PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
for different archs.


This is based on sha1
26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"

Please comment. Thanks.

Changelog:

v5:
* redid the whole thing via so-called IOMMU group capabilities

v4:
* rebased on recent upstream
* got all 6 patches from v2 (v3 was missing some)




Alexey Kardashevskiy (5):
  iommu: Add capabilities to a group
  iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
    remapping
  iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
    enabled
  powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
  vfio-pci: Allow to expose MSI-X table to userspace when safe

 include/linux/iommu.h            | 20 ++++++++++++++++++++
 include/linux/vfio.h             |  1 +
 arch/powerpc/kernel/iommu.c      |  1 +
 drivers/iommu/amd_iommu.c        |  3 +++
 drivers/iommu/intel-iommu.c      |  3 +++
 drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
 drivers/vfio/vfio.c              | 15 +++++++++++++++
 9 files changed, 99 insertions(+), 4 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-07  7:25 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich


This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
http://www.spinics.net/lists/kvm/msg152232.html

This time it is using "caps" in IOMMU groups. The main question is if PCI
bus flags or IOMMU domains are still better (and which one).



Here is some background:

Current vfio-pci implementation disallows to mmap the page
containing MSI-X table in case that users can write directly
to MSI-X table and generate an incorrect MSIs.

However, this will cause some performance issue when there
are some critical device registers in the same page as the
MSI-X table. We have to handle the mmio access to these
registers in QEMU emulation rather than in guest.

To solve this issue, this series allows to expose MSI-X table
to userspace when hardware enables the capability of interrupt
remapping which can ensure that a given PCI device can only
shoot the MSIs assigned for it. And we introduce a new bus_flags
PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
for different archs.


This is based on sha1
26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"

Please comment. Thanks.

Changelog:

v5:
* redid the whole thing via so-called IOMMU group capabilities

v4:
* rebased on recent upstream
* got all 6 patches from v2 (v3 was missing some)




Alexey Kardashevskiy (5):
  iommu: Add capabilities to a group
  iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
    remapping
  iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
    enabled
  powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
  vfio-pci: Allow to expose MSI-X table to userspace when safe

 include/linux/iommu.h            | 20 ++++++++++++++++++++
 include/linux/vfio.h             |  1 +
 arch/powerpc/kernel/iommu.c      |  1 +
 drivers/iommu/amd_iommu.c        |  3 +++
 drivers/iommu/intel-iommu.c      |  3 +++
 drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
 drivers/vfio/vfio.c              | 15 +++++++++++++++
 9 files changed, 99 insertions(+), 4 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-07  7:25 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich


This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
http://www.spinics.net/lists/kvm/msg152232.html

This time it is using "caps" in IOMMU groups. The main question is if PCI
bus flags or IOMMU domains are still better (and which one).



Here is some background:

Current vfio-pci implementation disallows to mmap the page
containing MSI-X table in case that users can write directly
to MSI-X table and generate an incorrect MSIs.

However, this will cause some performance issue when there
are some critical device registers in the same page as the
MSI-X table. We have to handle the mmio access to these
registers in QEMU emulation rather than in guest.

To solve this issue, this series allows to expose MSI-X table
to userspace when hardware enables the capability of interrupt
remapping which can ensure that a given PCI device can only
shoot the MSIs assigned for it. And we introduce a new bus_flags
PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
for different archs.


This is based on sha1
26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"

Please comment. Thanks.

Changelog:

v5:
* redid the whole thing via so-called IOMMU group capabilities

v4:
* rebased on recent upstream
* got all 6 patches from v2 (v3 was missing some)




Alexey Kardashevskiy (5):
  iommu: Add capabilities to a group
  iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
    remapping
  iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
    enabled
  powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
  vfio-pci: Allow to expose MSI-X table to userspace when safe

 include/linux/iommu.h            | 20 ++++++++++++++++++++
 include/linux/vfio.h             |  1 +
 arch/powerpc/kernel/iommu.c      |  1 +
 drivers/iommu/amd_iommu.c        |  3 +++
 drivers/iommu/intel-iommu.c      |  3 +++
 drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
 drivers/vfio/vfio.c              | 15 +++++++++++++++
 9 files changed, 99 insertions(+), 4 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-07  7:25 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel


This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
http://www.spinics.net/lists/kvm/msg152232.html

This time it is using "caps" in IOMMU groups. The main question is if PCI
bus flags or IOMMU domains are still better (and which one).



Here is some background:

Current vfio-pci implementation disallows to mmap the page
containing MSI-X table in case that users can write directly
to MSI-X table and generate an incorrect MSIs.

However, this will cause some performance issue when there
are some critical device registers in the same page as the
MSI-X table. We have to handle the mmio access to these
registers in QEMU emulation rather than in guest.

To solve this issue, this series allows to expose MSI-X table
to userspace when hardware enables the capability of interrupt
remapping which can ensure that a given PCI device can only
shoot the MSIs assigned for it. And we introduce a new bus_flags
PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
for different archs.


This is based on sha1
26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"

Please comment. Thanks.

Changelog:

v5:
* redid the whole thing via so-called IOMMU group capabilities

v4:
* rebased on recent upstream
* got all 6 patches from v2 (v3 was missing some)




Alexey Kardashevskiy (5):
  iommu: Add capabilities to a group
  iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
    remapping
  iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
    enabled
  powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
  vfio-pci: Allow to expose MSI-X table to userspace when safe

 include/linux/iommu.h            | 20 ++++++++++++++++++++
 include/linux/vfio.h             |  1 +
 arch/powerpc/kernel/iommu.c      |  1 +
 drivers/iommu/amd_iommu.c        |  3 +++
 drivers/iommu/intel-iommu.c      |  3 +++
 drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
 drivers/vfio/vfio.c              | 15 +++++++++++++++
 9 files changed, 99 insertions(+), 4 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 1/5] iommu: Add capabilities to a group
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

This introduces capabilities to IOMMU groups. The first defined
capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
group users that a particular IOMMU group is capable of MSIX message
filtering; this is useful when deciding whether or not to allow mapping
of MSIX table to the userspace. Various architectures will enable it
when they decide that it is safe to do so.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/iommu.h | 20 ++++++++++++++++++++
 drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2cb54adc4a33..6b6f3c2f4904 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -155,6 +155,9 @@ struct iommu_resv_region {
 	enum iommu_resv_type	type;
 };
 
+/* IOMMU group capabilities */
+#define IOMMU_GROUP_CAP_ISOLATE_MSIX		(1U)
+
 #ifdef CONFIG_IOMMU_API
 
 /**
@@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct iommu_group *group);
 extern void iommu_group_set_iommudata(struct iommu_group *group,
 				      void *iommu_data,
 				      void (*release)(void *iommu_data));
+extern void iommu_group_set_caps(struct iommu_group *group,
+				 unsigned long clearcaps,
+				 unsigned long setcaps);
+extern bool iommu_group_is_capable(struct iommu_group *group,
+				   unsigned long cap);
 extern int iommu_group_set_name(struct iommu_group *group, const char *name);
 extern int iommu_group_add_device(struct iommu_group *group,
 				  struct device *dev);
@@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct iommu_group *group,
 {
 }
 
+static inline void iommu_group_set_caps(struct iommu_group *group,
+					unsigned long clearcaps,
+					unsigned long setcaps)
+{
+}
+
+static inline bool iommu_group_is_capable(struct iommu_group *group,
+					  unsigned long cap)
+{
+	return false;
+}
+
 static inline int iommu_group_set_name(struct iommu_group *group,
 				       const char *name)
 {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3f6ea160afed..6b2c34fe2c3d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -52,6 +52,7 @@ struct iommu_group {
 	void (*iommu_data_release)(void *iommu_data);
 	char *name;
 	int id;
+	unsigned long caps;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *domain;
 };
@@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
 EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
 
 /**
+ * iommu_group_set_caps - Change the group capabilities
+ * @group: the group
+ * @clearcaps: capabilities mask to remove
+ * @setcaps: capabilities mask to add
+ *
+ * IOMMU groups can be capable of various features which device drivers
+ * may read and adjust the behavior.
+ */
+void iommu_group_set_caps(struct iommu_group *group,
+		unsigned long clearcaps, unsigned long setcaps)
+{
+	group->caps &= ~clearcaps;
+	group->caps |= setcaps;
+}
+EXPORT_SYMBOL_GPL(iommu_group_set_caps);
+
+/**
+ * iommu_group_is_capable - Returns if a group capability is present
+ * @group: the group
+ */
+bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
+{
+	return !!(group->caps & cap);
+}
+EXPORT_SYMBOL_GPL(iommu_group_is_capable);
+
+/**
  * iommu_group_set_name - set name for a group
  * @group: the group
  * @name: name
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 1/5] iommu: Add capabilities to a group
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Alexey Kardashevskiy,
	Benjamin Herrenschmidt, Kyle Mahlkuch, Kirti Wankhede,
	Paul Mackerras, Arvind Yadav, David Woodhouse,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, David Gibson,
	Yongji Xie, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mauricio Faria de Oliveira, Vlad Tsyrklevich

This introduces capabilities to IOMMU groups. The first defined
capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
group users that a particular IOMMU group is capable of MSIX message
filtering; this is useful when deciding whether or not to allow mapping
of MSIX table to the userspace. Various architectures will enable it
when they decide that it is safe to do so.

Signed-off-by: Alexey Kardashevskiy <aik-sLpHqDYs0B2HXe+LvDLADg@public.gmane.org>
---
 include/linux/iommu.h | 20 ++++++++++++++++++++
 drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2cb54adc4a33..6b6f3c2f4904 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -155,6 +155,9 @@ struct iommu_resv_region {
 	enum iommu_resv_type	type;
 };
 
+/* IOMMU group capabilities */
+#define IOMMU_GROUP_CAP_ISOLATE_MSIX		(1U)
+
 #ifdef CONFIG_IOMMU_API
 
 /**
@@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct iommu_group *group);
 extern void iommu_group_set_iommudata(struct iommu_group *group,
 				      void *iommu_data,
 				      void (*release)(void *iommu_data));
+extern void iommu_group_set_caps(struct iommu_group *group,
+				 unsigned long clearcaps,
+				 unsigned long setcaps);
+extern bool iommu_group_is_capable(struct iommu_group *group,
+				   unsigned long cap);
 extern int iommu_group_set_name(struct iommu_group *group, const char *name);
 extern int iommu_group_add_device(struct iommu_group *group,
 				  struct device *dev);
@@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct iommu_group *group,
 {
 }
 
+static inline void iommu_group_set_caps(struct iommu_group *group,
+					unsigned long clearcaps,
+					unsigned long setcaps)
+{
+}
+
+static inline bool iommu_group_is_capable(struct iommu_group *group,
+					  unsigned long cap)
+{
+	return false;
+}
+
 static inline int iommu_group_set_name(struct iommu_group *group,
 				       const char *name)
 {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3f6ea160afed..6b2c34fe2c3d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -52,6 +52,7 @@ struct iommu_group {
 	void (*iommu_data_release)(void *iommu_data);
 	char *name;
 	int id;
+	unsigned long caps;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *domain;
 };
@@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
 EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
 
 /**
+ * iommu_group_set_caps - Change the group capabilities
+ * @group: the group
+ * @clearcaps: capabilities mask to remove
+ * @setcaps: capabilities mask to add
+ *
+ * IOMMU groups can be capable of various features which device drivers
+ * may read and adjust the behavior.
+ */
+void iommu_group_set_caps(struct iommu_group *group,
+		unsigned long clearcaps, unsigned long setcaps)
+{
+	group->caps &= ~clearcaps;
+	group->caps |= setcaps;
+}
+EXPORT_SYMBOL_GPL(iommu_group_set_caps);
+
+/**
+ * iommu_group_is_capable - Returns if a group capability is present
+ * @group: the group
+ */
+bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
+{
+	return !!(group->caps & cap);
+}
+EXPORT_SYMBOL_GPL(iommu_group_is_capable);
+
+/**
  * iommu_group_set_name - set name for a group
  * @group: the group
  * @name: name
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 1/5] iommu: Add capabilities to a group
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

This introduces capabilities to IOMMU groups. The first defined
capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
group users that a particular IOMMU group is capable of MSIX message
filtering; this is useful when deciding whether or not to allow mapping
of MSIX table to the userspace. Various architectures will enable it
when they decide that it is safe to do so.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/iommu.h | 20 ++++++++++++++++++++
 drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2cb54adc4a33..6b6f3c2f4904 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -155,6 +155,9 @@ struct iommu_resv_region {
 	enum iommu_resv_type	type;
 };
 
+/* IOMMU group capabilities */
+#define IOMMU_GROUP_CAP_ISOLATE_MSIX		(1U)
+
 #ifdef CONFIG_IOMMU_API
 
 /**
@@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct iommu_group *group);
 extern void iommu_group_set_iommudata(struct iommu_group *group,
 				      void *iommu_data,
 				      void (*release)(void *iommu_data));
+extern void iommu_group_set_caps(struct iommu_group *group,
+				 unsigned long clearcaps,
+				 unsigned long setcaps);
+extern bool iommu_group_is_capable(struct iommu_group *group,
+				   unsigned long cap);
 extern int iommu_group_set_name(struct iommu_group *group, const char *name);
 extern int iommu_group_add_device(struct iommu_group *group,
 				  struct device *dev);
@@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct iommu_group *group,
 {
 }
 
+static inline void iommu_group_set_caps(struct iommu_group *group,
+					unsigned long clearcaps,
+					unsigned long setcaps)
+{
+}
+
+static inline bool iommu_group_is_capable(struct iommu_group *group,
+					  unsigned long cap)
+{
+	return false;
+}
+
 static inline int iommu_group_set_name(struct iommu_group *group,
 				       const char *name)
 {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3f6ea160afed..6b2c34fe2c3d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -52,6 +52,7 @@ struct iommu_group {
 	void (*iommu_data_release)(void *iommu_data);
 	char *name;
 	int id;
+	unsigned long caps;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *domain;
 };
@@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
 EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
 
 /**
+ * iommu_group_set_caps - Change the group capabilities
+ * @group: the group
+ * @clearcaps: capabilities mask to remove
+ * @setcaps: capabilities mask to add
+ *
+ * IOMMU groups can be capable of various features which device drivers
+ * may read and adjust the behavior.
+ */
+void iommu_group_set_caps(struct iommu_group *group,
+		unsigned long clearcaps, unsigned long setcaps)
+{
+	group->caps &= ~clearcaps;
+	group->caps |= setcaps;
+}
+EXPORT_SYMBOL_GPL(iommu_group_set_caps);
+
+/**
+ * iommu_group_is_capable - Returns if a group capability is present
+ * @group: the group
+ */
+bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
+{
+	return !!(group->caps & cap);
+}
+EXPORT_SYMBOL_GPL(iommu_group_is_capable);
+
+/**
  * iommu_group_set_name - set name for a group
  * @group: the group
  * @name: name
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 2/5] iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ remapping
  2017-08-07  7:25 ` Alexey Kardashevskiy
  (?)
  (?)
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if MSI remapping is
enabled on an IRQ domain; this is expected to set the capability
on ARM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/iommu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6b2c34fe2c3d..e720e90fa93c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/bitops.h>
 #include <linux/property.h>
+#include <linux/irqdomain.h>
 #include <trace/events/iommu.h>
 
 static struct kset *iommu_group_kset;
@@ -1028,6 +1029,7 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
 	struct iommu_group *group;
 	int ret;
+	struct irq_domain *d = dev_get_msi_domain(dev);
 
 	group = iommu_group_get(dev);
 	if (group)
@@ -1070,6 +1072,11 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
 		return ERR_PTR(ret);
 	}
 
+	if (d && irq_domain_is_msi(d) &&
+			irq_domain_hierarchical_is_msi_remap(d))
+		iommu_group_set_caps(group, 0,
+				IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	return group;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 2/5] iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ remapping
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if MSI remapping is
enabled on an IRQ domain; this is expected to set the capability
on ARM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/iommu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6b2c34fe2c3d..e720e90fa93c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/bitops.h>
 #include <linux/property.h>
+#include <linux/irqdomain.h>
 #include <trace/events/iommu.h>
 
 static struct kset *iommu_group_kset;
@@ -1028,6 +1029,7 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
 	struct iommu_group *group;
 	int ret;
+	struct irq_domain *d = dev_get_msi_domain(dev);
 
 	group = iommu_group_get(dev);
 	if (group)
@@ -1070,6 +1072,11 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
 		return ERR_PTR(ret);
 	}
 
+	if (d && irq_domain_is_msi(d) &&
+			irq_domain_hierarchical_is_msi_remap(d))
+		iommu_group_set_caps(group, 0,
+				IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	return group;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 2/5] iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ remapping
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if MSI remapping is
enabled on an IRQ domain; this is expected to set the capability
on ARM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/iommu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6b2c34fe2c3d..e720e90fa93c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/bitops.h>
 #include <linux/property.h>
+#include <linux/irqdomain.h>
 #include <trace/events/iommu.h>
 
 static struct kset *iommu_group_kset;
@@ -1028,6 +1029,7 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
 	struct iommu_group *group;
 	int ret;
+	struct irq_domain *d = dev_get_msi_domain(dev);
 
 	group = iommu_group_get(dev);
 	if (group)
@@ -1070,6 +1072,11 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
 		return ERR_PTR(ret);
 	}
 
+	if (d && irq_domain_is_msi(d) &&
+			irq_domain_hierarchical_is_msi_remap(d))
+		iommu_group_set_caps(group, 0,
+				IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	return group;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 2/5] iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ remapping
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if MSI remapping is
enabled on an IRQ domain; this is expected to set the capability
on ARM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/iommu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6b2c34fe2c3d..e720e90fa93c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/bitops.h>
 #include <linux/property.h>
+#include <linux/irqdomain.h>
 #include <trace/events/iommu.h>
 
 static struct kset *iommu_group_kset;
@@ -1028,6 +1029,7 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
 	struct iommu_group *group;
 	int ret;
+	struct irq_domain *d = dev_get_msi_domain(dev);
 
 	group = iommu_group_get(dev);
 	if (group)
@@ -1070,6 +1072,11 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
 		return ERR_PTR(ret);
 	}
 
+	if (d && irq_domain_is_msi(d) &&
+			irq_domain_hierarchical_is_msi_remap(d))
+		iommu_group_set_caps(group, 0,
+				IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	return group;
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 3/5] iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is enabled
  2017-08-07  7:25 ` Alexey Kardashevskiy
  (?)
  (?)
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if IRQ remapping
is enabled. For Intel, this checks disable_sourceid_checking in addition;
AMD ignores the "nosid" kernel parameters.

Here is some background on how the isolation works:

On Intel VT-d [1], there is an Interrupt Remapping Table, one entry per
interrupt, has a source-id (i.e. BDFN) of allowed device.

On AMD IOMMU [2], there is a Device Table, each entry is indexed by
DevideID which is BDFN.

[1] 9.10 Interrupt Remapping Table Entry (IRTE) for Remapped Interrupts
https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf

[2] "2.2 Data Structures" and "2.2.5 Interrupt Remapping Tables"
https://support.amd.com/TechDocs/48882_IOMMU.pdf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/amd_iommu.c   | 3 +++
 drivers/iommu/intel-iommu.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..d55fcbf3267e 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -470,6 +470,9 @@ static void init_iommu_group(struct device *dev)
 	if (IS_ERR(group))
 		return;
 
+	if (irq_remapping_enabled)
+		iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	iommu_group_put(group);
 }
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b7e670a7c243..3896c2d44bfa 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5233,6 +5233,9 @@ static int intel_iommu_add_device(struct device *dev)
 	if (IS_ERR(group))
 		return PTR_ERR(group);
 
+	if (irq_remapping_enabled && !disable_sourceid_checking)
+		iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	iommu_group_put(group);
 	return 0;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 3/5] iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is enabled
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if IRQ remapping
is enabled. For Intel, this checks disable_sourceid_checking in addition;
AMD ignores the "nosid" kernel parameters.

Here is some background on how the isolation works:

On Intel VT-d [1], there is an Interrupt Remapping Table, one entry per
interrupt, has a source-id (i.e. BDFN) of allowed device.

On AMD IOMMU [2], there is a Device Table, each entry is indexed by
DevideID which is BDFN.

[1] 9.10 Interrupt Remapping Table Entry (IRTE) for Remapped Interrupts
https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf

[2] "2.2 Data Structures" and "2.2.5 Interrupt Remapping Tables"
https://support.amd.com/TechDocs/48882_IOMMU.pdf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/amd_iommu.c   | 3 +++
 drivers/iommu/intel-iommu.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..d55fcbf3267e 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -470,6 +470,9 @@ static void init_iommu_group(struct device *dev)
 	if (IS_ERR(group))
 		return;
 
+	if (irq_remapping_enabled)
+		iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	iommu_group_put(group);
 }
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b7e670a7c243..3896c2d44bfa 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5233,6 +5233,9 @@ static int intel_iommu_add_device(struct device *dev)
 	if (IS_ERR(group))
 		return PTR_ERR(group);
 
+	if (irq_remapping_enabled && !disable_sourceid_checking)
+		iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	iommu_group_put(group);
 	return 0;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 3/5] iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is enabled
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if IRQ remapping
is enabled. For Intel, this checks disable_sourceid_checking in addition;
AMD ignores the "nosid" kernel parameters.

Here is some background on how the isolation works:

On Intel VT-d [1], there is an Interrupt Remapping Table, one entry per
interrupt, has a source-id (i.e. BDFN) of allowed device.

On AMD IOMMU [2], there is a Device Table, each entry is indexed by
DevideID which is BDFN.

[1] 9.10 Interrupt Remapping Table Entry (IRTE) for Remapped Interrupts
https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf

[2] "2.2 Data Structures" and "2.2.5 Interrupt Remapping Tables"
https://support.amd.com/TechDocs/48882_IOMMU.pdf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/amd_iommu.c   | 3 +++
 drivers/iommu/intel-iommu.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..d55fcbf3267e 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -470,6 +470,9 @@ static void init_iommu_group(struct device *dev)
 	if (IS_ERR(group))
 		return;
 
+	if (irq_remapping_enabled)
+		iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	iommu_group_put(group);
 }
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b7e670a7c243..3896c2d44bfa 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5233,6 +5233,9 @@ static int intel_iommu_add_device(struct device *dev)
 	if (IS_ERR(group))
 		return PTR_ERR(group);
 
+	if (irq_remapping_enabled && !disable_sourceid_checking)
+		iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	iommu_group_put(group);
 	return 0;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 3/5] iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is enabled
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if IRQ remapping
is enabled. For Intel, this checks disable_sourceid_checking in addition;
AMD ignores the "nosid" kernel parameters.

Here is some background on how the isolation works:

On Intel VT-d [1], there is an Interrupt Remapping Table, one entry per
interrupt, has a source-id (i.e. BDFN) of allowed device.

On AMD IOMMU [2], there is a Device Table, each entry is indexed by
DevideID which is BDFN.

[1] 9.10 Interrupt Remapping Table Entry (IRTE) for Remapped Interrupts
https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf

[2] "2.2 Data Structures" and "2.2.5 Interrupt Remapping Tables"
https://support.amd.com/TechDocs/48882_IOMMU.pdf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/amd_iommu.c   | 3 +++
 drivers/iommu/intel-iommu.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..d55fcbf3267e 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -470,6 +470,9 @@ static void init_iommu_group(struct device *dev)
 	if (IS_ERR(group))
 		return;
 
+	if (irq_remapping_enabled)
+		iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	iommu_group_put(group);
 }
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b7e670a7c243..3896c2d44bfa 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5233,6 +5233,9 @@ static int intel_iommu_add_device(struct device *dev)
 	if (IS_ERR(group))
 		return PTR_ERR(group);
 
+	if (irq_remapping_enabled && !disable_sourceid_checking)
+		iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
 	iommu_group_put(group);
 	return 0;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 4/5] powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
  2017-08-07  7:25 ` Alexey Kardashevskiy
  (?)
  (?)
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group unconditionally as
there is no IOMMU-capable hardware without such a feature.

On IBM POWERPC (POWER8) [1], PCI host bridge maintains BFDN-to-PE
translation (PE stands for "partitionable endpoint"), and PE index is used
to look at Interrupt Vector Table (IVT) to identify the interrupt server.
Without these translations in place, MSIX messages won't pass PHB.

[1] 3.2.4. MSI Design
http://openpowerfoundation.org/wp-content/uploads/resources/IODA2Spec/IODA2WGSpec-1.0.0-20160217.pdf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 233ca3fe4754..dca0a83f1560 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -936,6 +936,7 @@ void iommu_register_group(struct iommu_table_group *table_group,
 	if (!name)
 		return;
 	iommu_group_set_name(grp, name);
+	iommu_group_set_caps(grp, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
 	kfree(name);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 4/5] powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group unconditionally as
there is no IOMMU-capable hardware without such a feature.

On IBM POWERPC (POWER8) [1], PCI host bridge maintains BFDN-to-PE
translation (PE stands for "partitionable endpoint"), and PE index is used
to look at Interrupt Vector Table (IVT) to identify the interrupt server.
Without these translations in place, MSIX messages won't pass PHB.

[1] 3.2.4. MSI Design
http://openpowerfoundation.org/wp-content/uploads/resources/IODA2Spec/IODA2WGSpec-1.0.0-20160217.pdf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 233ca3fe4754..dca0a83f1560 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -936,6 +936,7 @@ void iommu_register_group(struct iommu_table_group *table_group,
 	if (!name)
 		return;
 	iommu_group_set_name(grp, name);
+	iommu_group_set_caps(grp, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
 	kfree(name);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 4/5] powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group unconditionally as
there is no IOMMU-capable hardware without such a feature.

On IBM POWERPC (POWER8) [1], PCI host bridge maintains BFDN-to-PE
translation (PE stands for "partitionable endpoint"), and PE index is used
to look at Interrupt Vector Table (IVT) to identify the interrupt server.
Without these translations in place, MSIX messages won't pass PHB.

[1] 3.2.4. MSI Design
http://openpowerfoundation.org/wp-content/uploads/resources/IODA2Spec/IODA2WGSpec-1.0.0-20160217.pdf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 233ca3fe4754..dca0a83f1560 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -936,6 +936,7 @@ void iommu_register_group(struct iommu_table_group *table_group,
 	if (!name)
 		return;
 	iommu_group_set_name(grp, name);
+	iommu_group_set_caps(grp, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
 	kfree(name);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 4/5] powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group unconditionally as
there is no IOMMU-capable hardware without such a feature.

On IBM POWERPC (POWER8) [1], PCI host bridge maintains BFDN-to-PE
translation (PE stands for "partitionable endpoint"), and PE index is used
to look at Interrupt Vector Table (IVT) to identify the interrupt server.
Without these translations in place, MSIX messages won't pass PHB.

[1] 3.2.4. MSI Design
http://openpowerfoundation.org/wp-content/uploads/resources/IODA2Spec/IODA2WGSpec-1.0.0-20160217.pdf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 233ca3fe4754..dca0a83f1560 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -936,6 +936,7 @@ void iommu_register_group(struct iommu_table_group *table_group,
 	if (!name)
 		return;
 	iommu_group_set_name(grp, name);
+	iommu_group_set_caps(grp, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
 	kfree(name);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe
  2017-08-07  7:25 ` Alexey Kardashevskiy
  (?)
  (?)
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

Some devices have a MSIX BAR not aligned to the system page size
greater than 4K (like 64k for ppc64) which at the moment prevents
such MMIO pages from being mapped to the userspace for the sake of
the MSIX BAR content protection. If such page happens to share
the same system page with some frequently accessed registers,
the entire system page will be emulated which can seriously affect
performance.

This allows mapping of MSI-X tables to userspace if hardware provides
MSIX isolation via interrupt remapping or filtering; in other words
allowing direct access to the MSIX BAR won't do any harm to other devices
or cause spurious interrupts visible to the kernel.

This adds a wrapping helper to check if a capability is supported by
an IOMMU group.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/vfio.h             |  1 +
 drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
 drivers/vfio/vfio.c              | 15 +++++++++++++++
 4 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 586809abb273..7110bca2fb60 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -46,6 +46,7 @@ struct vfio_device_ops {
 
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
 extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
+extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);
 
 extern int vfio_add_group_dev(struct device *dev,
 			      const struct vfio_device_ops *ops,
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index d87a0a3cda14..c4c39ed64b1e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	size_t end, size;
 	int nr_areas = 2, i = 0, ret;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
 
-	/* If MSI-X table is aligned to the start or end, only one area */
-	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
+	/*
+	 * If MSI-X table is allowed to mmap because of the capability
+	 * of IRQ remapping or aligned to the start or end, only one area
+	 */
+	if (is_msix_isolated ||
+	    ((vdev->msix_offset & PAGE_MASK) == 0) ||
 	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
 		nr_areas = 1;
 
@@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 
 	sparse->nr_areas = nr_areas;
 
+	if (is_msix_isolated) {
+		sparse->areas[i].offset = 0;
+		sparse->areas[i].size = end;
+		return 0;
+	}
+
 	if (vdev->msix_offset & PAGE_MASK) {
 		sparse->areas[i].offset = 0;
 		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
@@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	unsigned int index;
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
@@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	if (req_start + req_len > phys_len)
 		return -EINVAL;
 
-	if (index == vdev->msix_bar) {
+	if (index == vdev->msix_bar && !is_msix_isolated) {
 		/*
 		 * Disallow mmaps overlapping the MSI-X table; users don't
 		 * get to touch this directly.  We could find somewhere
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 357243d76f10..7514206a5ea7 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
@@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	resource_size_t end;
 	void __iomem *io;
 	ssize_t done;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	if (pci_resource_start(pdev, bar))
 		end = pci_resource_len(pdev, bar);
@@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	} else
 		io = vdev->barmap[bar];
 
-	if (bar == vdev->msix_bar) {
+	if (bar == vdev->msix_bar && !is_msix_isolated) {
 		x_start = vdev->msix_offset;
 		x_end = vdev->msix_offset + vdev->msix_size;
 	}
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 330d50582f40..5292c4a5ae8f 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
 
+bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
+{
+	bool ret = false;
+	struct iommu_group *group = vfio_iommu_group_get(dev);
+
+	if (group) {
+		ret = iommu_group_is_capable(group, cap);
+
+		vfio_iommu_group_put(group, dev);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
+
 #ifdef CONFIG_VFIO_NOIOMMU
 static void *vfio_noiommu_open(unsigned long arg)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich

Some devices have a MSIX BAR not aligned to the system page size
greater than 4K (like 64k for ppc64) which at the moment prevents
such MMIO pages from being mapped to the userspace for the sake of
the MSIX BAR content protection. If such page happens to share
the same system page with some frequently accessed registers,
the entire system page will be emulated which can seriously affect
performance.

This allows mapping of MSI-X tables to userspace if hardware provides
MSIX isolation via interrupt remapping or filtering; in other words
allowing direct access to the MSIX BAR won't do any harm to other devices
or cause spurious interrupts visible to the kernel.

This adds a wrapping helper to check if a capability is supported by
an IOMMU group.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/vfio.h             |  1 +
 drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
 drivers/vfio/vfio.c              | 15 +++++++++++++++
 4 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 586809abb273..7110bca2fb60 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -46,6 +46,7 @@ struct vfio_device_ops {
 
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
 extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
+extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);
 
 extern int vfio_add_group_dev(struct device *dev,
 			      const struct vfio_device_ops *ops,
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index d87a0a3cda14..c4c39ed64b1e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	size_t end, size;
 	int nr_areas = 2, i = 0, ret;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
 
-	/* If MSI-X table is aligned to the start or end, only one area */
-	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
+	/*
+	 * If MSI-X table is allowed to mmap because of the capability
+	 * of IRQ remapping or aligned to the start or end, only one area
+	 */
+	if (is_msix_isolated ||
+	    ((vdev->msix_offset & PAGE_MASK) == 0) ||
 	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
 		nr_areas = 1;
 
@@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 
 	sparse->nr_areas = nr_areas;
 
+	if (is_msix_isolated) {
+		sparse->areas[i].offset = 0;
+		sparse->areas[i].size = end;
+		return 0;
+	}
+
 	if (vdev->msix_offset & PAGE_MASK) {
 		sparse->areas[i].offset = 0;
 		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
@@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	unsigned int index;
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
@@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	if (req_start + req_len > phys_len)
 		return -EINVAL;
 
-	if (index == vdev->msix_bar) {
+	if (index == vdev->msix_bar && !is_msix_isolated) {
 		/*
 		 * Disallow mmaps overlapping the MSI-X table; users don't
 		 * get to touch this directly.  We could find somewhere
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 357243d76f10..7514206a5ea7 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
@@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	resource_size_t end;
 	void __iomem *io;
 	ssize_t done;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	if (pci_resource_start(pdev, bar))
 		end = pci_resource_len(pdev, bar);
@@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	} else
 		io = vdev->barmap[bar];
 
-	if (bar == vdev->msix_bar) {
+	if (bar == vdev->msix_bar && !is_msix_isolated) {
 		x_start = vdev->msix_offset;
 		x_end = vdev->msix_offset + vdev->msix_size;
 	}
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 330d50582f40..5292c4a5ae8f 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
 
+bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
+{
+	bool ret = false;
+	struct iommu_group *group = vfio_iommu_group_get(dev);
+
+	if (group) {
+		ret = iommu_group_is_capable(group, cap);
+
+		vfio_iommu_group_put(group, dev);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
+
 #ifdef CONFIG_VFIO_NOIOMMU
 static void *vfio_noiommu_open(unsigned long arg)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich

Some devices have a MSIX BAR not aligned to the system page size
greater than 4K (like 64k for ppc64) which at the moment prevents
such MMIO pages from being mapped to the userspace for the sake of
the MSIX BAR content protection. If such page happens to share
the same system page with some frequently accessed registers,
the entire system page will be emulated which can seriously affect
performance.

This allows mapping of MSI-X tables to userspace if hardware provides
MSIX isolation via interrupt remapping or filtering; in other words
allowing direct access to the MSIX BAR won't do any harm to other devices
or cause spurious interrupts visible to the kernel.

This adds a wrapping helper to check if a capability is supported by
an IOMMU group.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/vfio.h             |  1 +
 drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
 drivers/vfio/vfio.c              | 15 +++++++++++++++
 4 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 586809abb273..7110bca2fb60 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -46,6 +46,7 @@ struct vfio_device_ops {
 
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
 extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
+extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);
 
 extern int vfio_add_group_dev(struct device *dev,
 			      const struct vfio_device_ops *ops,
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index d87a0a3cda14..c4c39ed64b1e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	size_t end, size;
 	int nr_areas = 2, i = 0, ret;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
 
-	/* If MSI-X table is aligned to the start or end, only one area */
-	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
+	/*
+	 * If MSI-X table is allowed to mmap because of the capability
+	 * of IRQ remapping or aligned to the start or end, only one area
+	 */
+	if (is_msix_isolated ||
+	    ((vdev->msix_offset & PAGE_MASK) == 0) ||
 	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
 		nr_areas = 1;
 
@@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 
 	sparse->nr_areas = nr_areas;
 
+	if (is_msix_isolated) {
+		sparse->areas[i].offset = 0;
+		sparse->areas[i].size = end;
+		return 0;
+	}
+
 	if (vdev->msix_offset & PAGE_MASK) {
 		sparse->areas[i].offset = 0;
 		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
@@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	unsigned int index;
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
@@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	if (req_start + req_len > phys_len)
 		return -EINVAL;
 
-	if (index == vdev->msix_bar) {
+	if (index == vdev->msix_bar && !is_msix_isolated) {
 		/*
 		 * Disallow mmaps overlapping the MSI-X table; users don't
 		 * get to touch this directly.  We could find somewhere
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 357243d76f10..7514206a5ea7 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
@@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	resource_size_t end;
 	void __iomem *io;
 	ssize_t done;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	if (pci_resource_start(pdev, bar))
 		end = pci_resource_len(pdev, bar);
@@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	} else
 		io = vdev->barmap[bar];
 
-	if (bar == vdev->msix_bar) {
+	if (bar == vdev->msix_bar && !is_msix_isolated) {
 		x_start = vdev->msix_offset;
 		x_end = vdev->msix_offset + vdev->msix_size;
 	}
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 330d50582f40..5292c4a5ae8f 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
 
+bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
+{
+	bool ret = false;
+	struct iommu_group *group = vfio_iommu_group_get(dev);
+
+	if (group) {
+		ret = iommu_group_is_capable(group, cap);
+
+		vfio_iommu_group_put(group, dev);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
+
 #ifdef CONFIG_VFIO_NOIOMMU
 static void *vfio_noiommu_open(unsigned long arg)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe
@ 2017-08-07  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-07  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Yongji Xie,
	Eric Auger, Kyle Mahlkuch, Alex Williamson, Jike Song,
	Bjorn Helgaas, Robin Murphy, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

Some devices have a MSIX BAR not aligned to the system page size
greater than 4K (like 64k for ppc64) which at the moment prevents
such MMIO pages from being mapped to the userspace for the sake of
the MSIX BAR content protection. If such page happens to share
the same system page with some frequently accessed registers,
the entire system page will be emulated which can seriously affect
performance.

This allows mapping of MSI-X tables to userspace if hardware provides
MSIX isolation via interrupt remapping or filtering; in other words
allowing direct access to the MSIX BAR won't do any harm to other devices
or cause spurious interrupts visible to the kernel.

This adds a wrapping helper to check if a capability is supported by
an IOMMU group.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/vfio.h             |  1 +
 drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
 drivers/vfio/vfio.c              | 15 +++++++++++++++
 4 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 586809abb273..7110bca2fb60 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -46,6 +46,7 @@ struct vfio_device_ops {
 
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
 extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
+extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);
 
 extern int vfio_add_group_dev(struct device *dev,
 			      const struct vfio_device_ops *ops,
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index d87a0a3cda14..c4c39ed64b1e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	size_t end, size;
 	int nr_areas = 2, i = 0, ret;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
 
-	/* If MSI-X table is aligned to the start or end, only one area */
-	if (((vdev->msix_offset & PAGE_MASK) = 0) ||
+	/*
+	 * If MSI-X table is allowed to mmap because of the capability
+	 * of IRQ remapping or aligned to the start or end, only one area
+	 */
+	if (is_msix_isolated ||
+	    ((vdev->msix_offset & PAGE_MASK) = 0) ||
 	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
 		nr_areas = 1;
 
@@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 
 	sparse->nr_areas = nr_areas;
 
+	if (is_msix_isolated) {
+		sparse->areas[i].offset = 0;
+		sparse->areas[i].size = end;
+		return 0;
+	}
+
 	if (vdev->msix_offset & PAGE_MASK) {
 		sparse->areas[i].offset = 0;
 		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
@@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	unsigned int index;
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
@@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	if (req_start + req_len > phys_len)
 		return -EINVAL;
 
-	if (index = vdev->msix_bar) {
+	if (index = vdev->msix_bar && !is_msix_isolated) {
 		/*
 		 * Disallow mmaps overlapping the MSI-X table; users don't
 		 * get to touch this directly.  We could find somewhere
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 357243d76f10..7514206a5ea7 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
@@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	resource_size_t end;
 	void __iomem *io;
 	ssize_t done;
+	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+			IOMMU_GROUP_CAP_ISOLATE_MSIX);
 
 	if (pci_resource_start(pdev, bar))
 		end = pci_resource_len(pdev, bar);
@@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	} else
 		io = vdev->barmap[bar];
 
-	if (bar = vdev->msix_bar) {
+	if (bar = vdev->msix_bar && !is_msix_isolated) {
 		x_start = vdev->msix_offset;
 		x_end = vdev->msix_offset + vdev->msix_size;
 	}
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 330d50582f40..5292c4a5ae8f 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
 
+bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
+{
+	bool ret = false;
+	struct iommu_group *group = vfio_iommu_group_get(dev);
+
+	if (group) {
+		ret = iommu_group_is_capable(group, cap);
+
+		vfio_iommu_group_put(group, dev);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
+
 #ifdef CONFIG_VFIO_NOIOMMU
 static void *vfio_noiommu_open(unsigned long arg)
 {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 1/5] iommu: Add capabilities to a group
@ 2017-08-09  5:55     ` David Gibson
  0 siblings, 0 replies; 82+ messages in thread
From: David Gibson @ 2017-08-09  5:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4138 bytes --]

On Mon, Aug 07, 2017 at 05:25:44PM +1000, Alexey Kardashevskiy wrote:
> This introduces capabilities to IOMMU groups. The first defined
> capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
> group users that a particular IOMMU group is capable of MSIX message
> filtering; this is useful when deciding whether or not to allow mapping
> of MSIX table to the userspace. Various architectures will enable it
> when they decide that it is safe to do so.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

This seems like a reasonable concept that's probably useful for
something, whether or not it's the best approach for the problem at
hand.

> ---
>  include/linux/iommu.h | 20 ++++++++++++++++++++
>  drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
>  2 files changed, 48 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2cb54adc4a33..6b6f3c2f4904 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -155,6 +155,9 @@ struct iommu_resv_region {
>  	enum iommu_resv_type	type;
>  };
>  
> +/* IOMMU group capabilities */
> +#define IOMMU_GROUP_CAP_ISOLATE_MSIX		(1U)
> +
>  #ifdef CONFIG_IOMMU_API
>  
>  /**
> @@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct iommu_group *group);
>  extern void iommu_group_set_iommudata(struct iommu_group *group,
>  				      void *iommu_data,
>  				      void (*release)(void *iommu_data));
> +extern void iommu_group_set_caps(struct iommu_group *group,
> +				 unsigned long clearcaps,
> +				 unsigned long setcaps);
> +extern bool iommu_group_is_capable(struct iommu_group *group,
> +				   unsigned long cap);
>  extern int iommu_group_set_name(struct iommu_group *group, const char *name);
>  extern int iommu_group_add_device(struct iommu_group *group,
>  				  struct device *dev);
> @@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct iommu_group *group,
>  {
>  }
>  
> +static inline void iommu_group_set_caps(struct iommu_group *group,
> +					unsigned long clearcaps,
> +					unsigned long setcaps)
> +{
> +}
> +
> +static inline bool iommu_group_is_capable(struct iommu_group *group,
> +					  unsigned long cap)
> +{
> +	return false;
> +}
> +
>  static inline int iommu_group_set_name(struct iommu_group *group,
>  				       const char *name)
>  {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3f6ea160afed..6b2c34fe2c3d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -52,6 +52,7 @@ struct iommu_group {
>  	void (*iommu_data_release)(void *iommu_data);
>  	char *name;
>  	int id;
> +	unsigned long caps;
>  	struct iommu_domain *default_domain;
>  	struct iommu_domain *domain;
>  };
> @@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
>  EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
>  
>  /**
> + * iommu_group_set_caps - Change the group capabilities
> + * @group: the group
> + * @clearcaps: capabilities mask to remove
> + * @setcaps: capabilities mask to add
> + *
> + * IOMMU groups can be capable of various features which device drivers
> + * may read and adjust the behavior.
> + */
> +void iommu_group_set_caps(struct iommu_group *group,
> +		unsigned long clearcaps, unsigned long setcaps)
> +{
> +	group->caps &= ~clearcaps;
> +	group->caps |= setcaps;
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_set_caps);
> +
> +/**
> + * iommu_group_is_capable - Returns if a group capability is present
> + * @group: the group
> + */
> +bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
> +{
> +	return !!(group->caps & cap);
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_is_capable);
> +
> +/**
>   * iommu_group_set_name - set name for a group
>   * @group: the group
>   * @name: name

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 1/5] iommu: Add capabilities to a group
@ 2017-08-09  5:55     ` David Gibson
  0 siblings, 0 replies; 82+ messages in thread
From: David Gibson @ 2017-08-09  5:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA, Kyle Mahlkuch,
	Kirti Wankhede, kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras,
	Benjamin Herrenschmidt, Bjorn Helgaas, Arvind Yadav,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ, David Woodhouse,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 4196 bytes --]

On Mon, Aug 07, 2017 at 05:25:44PM +1000, Alexey Kardashevskiy wrote:
> This introduces capabilities to IOMMU groups. The first defined
> capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
> group users that a particular IOMMU group is capable of MSIX message
> filtering; this is useful when deciding whether or not to allow mapping
> of MSIX table to the userspace. Various architectures will enable it
> when they decide that it is safe to do so.
> 
> Signed-off-by: Alexey Kardashevskiy <aik-sLpHqDYs0B2HXe+LvDLADg@public.gmane.org>

Reviewed-by: David Gibson <david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+@public.gmane.org>

This seems like a reasonable concept that's probably useful for
something, whether or not it's the best approach for the problem at
hand.

> ---
>  include/linux/iommu.h | 20 ++++++++++++++++++++
>  drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
>  2 files changed, 48 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2cb54adc4a33..6b6f3c2f4904 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -155,6 +155,9 @@ struct iommu_resv_region {
>  	enum iommu_resv_type	type;
>  };
>  
> +/* IOMMU group capabilities */
> +#define IOMMU_GROUP_CAP_ISOLATE_MSIX		(1U)
> +
>  #ifdef CONFIG_IOMMU_API
>  
>  /**
> @@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct iommu_group *group);
>  extern void iommu_group_set_iommudata(struct iommu_group *group,
>  				      void *iommu_data,
>  				      void (*release)(void *iommu_data));
> +extern void iommu_group_set_caps(struct iommu_group *group,
> +				 unsigned long clearcaps,
> +				 unsigned long setcaps);
> +extern bool iommu_group_is_capable(struct iommu_group *group,
> +				   unsigned long cap);
>  extern int iommu_group_set_name(struct iommu_group *group, const char *name);
>  extern int iommu_group_add_device(struct iommu_group *group,
>  				  struct device *dev);
> @@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct iommu_group *group,
>  {
>  }
>  
> +static inline void iommu_group_set_caps(struct iommu_group *group,
> +					unsigned long clearcaps,
> +					unsigned long setcaps)
> +{
> +}
> +
> +static inline bool iommu_group_is_capable(struct iommu_group *group,
> +					  unsigned long cap)
> +{
> +	return false;
> +}
> +
>  static inline int iommu_group_set_name(struct iommu_group *group,
>  				       const char *name)
>  {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3f6ea160afed..6b2c34fe2c3d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -52,6 +52,7 @@ struct iommu_group {
>  	void (*iommu_data_release)(void *iommu_data);
>  	char *name;
>  	int id;
> +	unsigned long caps;
>  	struct iommu_domain *default_domain;
>  	struct iommu_domain *domain;
>  };
> @@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
>  EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
>  
>  /**
> + * iommu_group_set_caps - Change the group capabilities
> + * @group: the group
> + * @clearcaps: capabilities mask to remove
> + * @setcaps: capabilities mask to add
> + *
> + * IOMMU groups can be capable of various features which device drivers
> + * may read and adjust the behavior.
> + */
> +void iommu_group_set_caps(struct iommu_group *group,
> +		unsigned long clearcaps, unsigned long setcaps)
> +{
> +	group->caps &= ~clearcaps;
> +	group->caps |= setcaps;
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_set_caps);
> +
> +/**
> + * iommu_group_is_capable - Returns if a group capability is present
> + * @group: the group
> + */
> +bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
> +{
> +	return !!(group->caps & cap);
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_is_capable);
> +
> +/**
>   * iommu_group_set_name - set name for a group
>   * @group: the group
>   * @name: name

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 1/5] iommu: Add capabilities to a group
@ 2017-08-09  5:55     ` David Gibson
  0 siblings, 0 replies; 82+ messages in thread
From: David Gibson @ 2017-08-09  5:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4138 bytes --]

On Mon, Aug 07, 2017 at 05:25:44PM +1000, Alexey Kardashevskiy wrote:
> This introduces capabilities to IOMMU groups. The first defined
> capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
> group users that a particular IOMMU group is capable of MSIX message
> filtering; this is useful when deciding whether or not to allow mapping
> of MSIX table to the userspace. Various architectures will enable it
> when they decide that it is safe to do so.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

This seems like a reasonable concept that's probably useful for
something, whether or not it's the best approach for the problem at
hand.

> ---
>  include/linux/iommu.h | 20 ++++++++++++++++++++
>  drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
>  2 files changed, 48 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2cb54adc4a33..6b6f3c2f4904 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -155,6 +155,9 @@ struct iommu_resv_region {
>  	enum iommu_resv_type	type;
>  };
>  
> +/* IOMMU group capabilities */
> +#define IOMMU_GROUP_CAP_ISOLATE_MSIX		(1U)
> +
>  #ifdef CONFIG_IOMMU_API
>  
>  /**
> @@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct iommu_group *group);
>  extern void iommu_group_set_iommudata(struct iommu_group *group,
>  				      void *iommu_data,
>  				      void (*release)(void *iommu_data));
> +extern void iommu_group_set_caps(struct iommu_group *group,
> +				 unsigned long clearcaps,
> +				 unsigned long setcaps);
> +extern bool iommu_group_is_capable(struct iommu_group *group,
> +				   unsigned long cap);
>  extern int iommu_group_set_name(struct iommu_group *group, const char *name);
>  extern int iommu_group_add_device(struct iommu_group *group,
>  				  struct device *dev);
> @@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct iommu_group *group,
>  {
>  }
>  
> +static inline void iommu_group_set_caps(struct iommu_group *group,
> +					unsigned long clearcaps,
> +					unsigned long setcaps)
> +{
> +}
> +
> +static inline bool iommu_group_is_capable(struct iommu_group *group,
> +					  unsigned long cap)
> +{
> +	return false;
> +}
> +
>  static inline int iommu_group_set_name(struct iommu_group *group,
>  				       const char *name)
>  {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3f6ea160afed..6b2c34fe2c3d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -52,6 +52,7 @@ struct iommu_group {
>  	void (*iommu_data_release)(void *iommu_data);
>  	char *name;
>  	int id;
> +	unsigned long caps;
>  	struct iommu_domain *default_domain;
>  	struct iommu_domain *domain;
>  };
> @@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
>  EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
>  
>  /**
> + * iommu_group_set_caps - Change the group capabilities
> + * @group: the group
> + * @clearcaps: capabilities mask to remove
> + * @setcaps: capabilities mask to add
> + *
> + * IOMMU groups can be capable of various features which device drivers
> + * may read and adjust the behavior.
> + */
> +void iommu_group_set_caps(struct iommu_group *group,
> +		unsigned long clearcaps, unsigned long setcaps)
> +{
> +	group->caps &= ~clearcaps;
> +	group->caps |= setcaps;
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_set_caps);
> +
> +/**
> + * iommu_group_is_capable - Returns if a group capability is present
> + * @group: the group
> + */
> +bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
> +{
> +	return !!(group->caps & cap);
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_is_capable);
> +
> +/**
>   * iommu_group_set_name - set name for a group
>   * @group: the group
>   * @name: name

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe
@ 2017-08-09  6:59     ` David Gibson
  0 siblings, 0 replies; 82+ messages in thread
From: David Gibson @ 2017-08-09  6:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6228 bytes --]

On Mon, Aug 07, 2017 at 05:25:48PM +1000, Alexey Kardashevskiy wrote:
1;4803;0c> Some devices have a MSIX BAR not aligned to the system page size
> greater than 4K (like 64k for ppc64) which at the moment prevents
> such MMIO pages from being mapped to the userspace for the sake of
> the MSIX BAR content protection. If such page happens to share
> the same system page with some frequently accessed registers,
> the entire system page will be emulated which can seriously affect
> performance.
> 
> This allows mapping of MSI-X tables to userspace if hardware provides
> MSIX isolation via interrupt remapping or filtering; in other words
> allowing direct access to the MSIX BAR won't do any harm to other devices
> or cause spurious interrupts visible to the kernel.
> 
> This adds a wrapping helper to check if a capability is supported by
> an IOMMU group.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/linux/vfio.h             |  1 +
>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>  4 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 586809abb273..7110bca2fb60 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -46,6 +46,7 @@ struct vfio_device_ops {
>  
>  extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>  extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
> +extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);

This diff probably belongs in the earlier patch adding the function,
rather than here where it's first used.  Not worth respinning just for
that, though.

>  extern int vfio_add_group_dev(struct device *dev,
>  			      const struct vfio_device_ops *ops,
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d87a0a3cda14..c4c39ed64b1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  	struct vfio_region_info_cap_sparse_mmap *sparse;
>  	size_t end, size;
>  	int nr_areas = 2, i = 0, ret;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
> -	/* If MSI-X table is aligned to the start or end, only one area */
> -	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
> +	/*
> +	 * If MSI-X table is allowed to mmap because of the capability
> +	 * of IRQ remapping or aligned to the start or end, only one area
> +	 */
> +	if (is_msix_isolated ||
> +	    ((vdev->msix_offset & PAGE_MASK) == 0) ||
>  	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
>  		nr_areas = 1;
>  
> @@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  
>  	sparse->nr_areas = nr_areas;
>  
> +	if (is_msix_isolated) {
> +		sparse->areas[i].offset = 0;
> +		sparse->areas[i].size = end;
> +		return 0;
> +	}
> +
>  	if (vdev->msix_offset & PAGE_MASK) {
>  		sparse->areas[i].offset = 0;
>  		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
> @@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	unsigned int index;
>  	u64 phys_len, req_len, pgoff, req_start;
>  	int ret;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>  
> @@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	if (req_start + req_len > phys_len)
>  		return -EINVAL;
>  
> -	if (index == vdev->msix_bar) {
> +	if (index == vdev->msix_bar && !is_msix_isolated) {
>  		/*
>  		 * Disallow mmaps overlapping the MSI-X table; users don't
>  		 * get to touch this directly.  We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 357243d76f10..7514206a5ea7 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <linux/vgaarb.h>
> +#include <linux/vfio.h>
>  
>  #include "vfio_pci_private.h"
>  
> @@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>  	resource_size_t end;
>  	void __iomem *io;
>  	ssize_t done;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	if (pci_resource_start(pdev, bar))
>  		end = pci_resource_len(pdev, bar);
> @@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>  	} else
>  		io = vdev->barmap[bar];
>  
> -	if (bar == vdev->msix_bar) {
> +	if (bar == vdev->msix_bar && !is_msix_isolated) {
>  		x_start = vdev->msix_offset;
>  		x_end = vdev->msix_offset + vdev->msix_size;
>  	}
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 330d50582f40..5292c4a5ae8f 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
>  
> +bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
> +{
> +	bool ret = false;
> +	struct iommu_group *group = vfio_iommu_group_get(dev);
> +
> +	if (group) {
> +		ret = iommu_group_is_capable(group, cap);
> +
> +		vfio_iommu_group_put(group, dev);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
> +
>  #ifdef CONFIG_VFIO_NOIOMMU
>  static void *vfio_noiommu_open(unsigned long arg)
>  {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe
@ 2017-08-09  6:59     ` David Gibson
  0 siblings, 0 replies; 82+ messages in thread
From: David Gibson @ 2017-08-09  6:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA, Kyle Mahlkuch,
	Kirti Wankhede, kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras,
	Benjamin Herrenschmidt, Bjorn Helgaas, Arvind Yadav,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ, David Woodhouse,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 6286 bytes --]

On Mon, Aug 07, 2017 at 05:25:48PM +1000, Alexey Kardashevskiy wrote:
1;4803;0c> Some devices have a MSIX BAR not aligned to the system page size
> greater than 4K (like 64k for ppc64) which at the moment prevents
> such MMIO pages from being mapped to the userspace for the sake of
> the MSIX BAR content protection. If such page happens to share
> the same system page with some frequently accessed registers,
> the entire system page will be emulated which can seriously affect
> performance.
> 
> This allows mapping of MSI-X tables to userspace if hardware provides
> MSIX isolation via interrupt remapping or filtering; in other words
> allowing direct access to the MSIX BAR won't do any harm to other devices
> or cause spurious interrupts visible to the kernel.
> 
> This adds a wrapping helper to check if a capability is supported by
> an IOMMU group.
> 
> Signed-off-by: Alexey Kardashevskiy <aik-sLpHqDYs0B2HXe+LvDLADg@public.gmane.org>

Reviewed-by: David Gibson <david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+@public.gmane.org>

> ---
>  include/linux/vfio.h             |  1 +
>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>  4 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 586809abb273..7110bca2fb60 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -46,6 +46,7 @@ struct vfio_device_ops {
>  
>  extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>  extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
> +extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);

This diff probably belongs in the earlier patch adding the function,
rather than here where it's first used.  Not worth respinning just for
that, though.

>  extern int vfio_add_group_dev(struct device *dev,
>  			      const struct vfio_device_ops *ops,
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d87a0a3cda14..c4c39ed64b1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  	struct vfio_region_info_cap_sparse_mmap *sparse;
>  	size_t end, size;
>  	int nr_areas = 2, i = 0, ret;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
> -	/* If MSI-X table is aligned to the start or end, only one area */
> -	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
> +	/*
> +	 * If MSI-X table is allowed to mmap because of the capability
> +	 * of IRQ remapping or aligned to the start or end, only one area
> +	 */
> +	if (is_msix_isolated ||
> +	    ((vdev->msix_offset & PAGE_MASK) == 0) ||
>  	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
>  		nr_areas = 1;
>  
> @@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  
>  	sparse->nr_areas = nr_areas;
>  
> +	if (is_msix_isolated) {
> +		sparse->areas[i].offset = 0;
> +		sparse->areas[i].size = end;
> +		return 0;
> +	}
> +
>  	if (vdev->msix_offset & PAGE_MASK) {
>  		sparse->areas[i].offset = 0;
>  		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
> @@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	unsigned int index;
>  	u64 phys_len, req_len, pgoff, req_start;
>  	int ret;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>  
> @@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	if (req_start + req_len > phys_len)
>  		return -EINVAL;
>  
> -	if (index == vdev->msix_bar) {
> +	if (index == vdev->msix_bar && !is_msix_isolated) {
>  		/*
>  		 * Disallow mmaps overlapping the MSI-X table; users don't
>  		 * get to touch this directly.  We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 357243d76f10..7514206a5ea7 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <linux/vgaarb.h>
> +#include <linux/vfio.h>
>  
>  #include "vfio_pci_private.h"
>  
> @@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>  	resource_size_t end;
>  	void __iomem *io;
>  	ssize_t done;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	if (pci_resource_start(pdev, bar))
>  		end = pci_resource_len(pdev, bar);
> @@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>  	} else
>  		io = vdev->barmap[bar];
>  
> -	if (bar == vdev->msix_bar) {
> +	if (bar == vdev->msix_bar && !is_msix_isolated) {
>  		x_start = vdev->msix_offset;
>  		x_end = vdev->msix_offset + vdev->msix_size;
>  	}
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 330d50582f40..5292c4a5ae8f 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
>  
> +bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
> +{
> +	bool ret = false;
> +	struct iommu_group *group = vfio_iommu_group_get(dev);
> +
> +	if (group) {
> +		ret = iommu_group_is_capable(group, cap);
> +
> +		vfio_iommu_group_put(group, dev);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
> +
>  #ifdef CONFIG_VFIO_NOIOMMU
>  static void *vfio_noiommu_open(unsigned long arg)
>  {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe
@ 2017-08-09  6:59     ` David Gibson
  0 siblings, 0 replies; 82+ messages in thread
From: David Gibson @ 2017-08-09  6:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6228 bytes --]

On Mon, Aug 07, 2017 at 05:25:48PM +1000, Alexey Kardashevskiy wrote:
1;4803;0c> Some devices have a MSIX BAR not aligned to the system page size
> greater than 4K (like 64k for ppc64) which at the moment prevents
> such MMIO pages from being mapped to the userspace for the sake of
> the MSIX BAR content protection. If such page happens to share
> the same system page with some frequently accessed registers,
> the entire system page will be emulated which can seriously affect
> performance.
> 
> This allows mapping of MSI-X tables to userspace if hardware provides
> MSIX isolation via interrupt remapping or filtering; in other words
> allowing direct access to the MSIX BAR won't do any harm to other devices
> or cause spurious interrupts visible to the kernel.
> 
> This adds a wrapping helper to check if a capability is supported by
> an IOMMU group.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/linux/vfio.h             |  1 +
>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>  4 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 586809abb273..7110bca2fb60 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -46,6 +46,7 @@ struct vfio_device_ops {
>  
>  extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>  extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
> +extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);

This diff probably belongs in the earlier patch adding the function,
rather than here where it's first used.  Not worth respinning just for
that, though.

>  extern int vfio_add_group_dev(struct device *dev,
>  			      const struct vfio_device_ops *ops,
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d87a0a3cda14..c4c39ed64b1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  	struct vfio_region_info_cap_sparse_mmap *sparse;
>  	size_t end, size;
>  	int nr_areas = 2, i = 0, ret;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
> -	/* If MSI-X table is aligned to the start or end, only one area */
> -	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
> +	/*
> +	 * If MSI-X table is allowed to mmap because of the capability
> +	 * of IRQ remapping or aligned to the start or end, only one area
> +	 */
> +	if (is_msix_isolated ||
> +	    ((vdev->msix_offset & PAGE_MASK) == 0) ||
>  	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
>  		nr_areas = 1;
>  
> @@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  
>  	sparse->nr_areas = nr_areas;
>  
> +	if (is_msix_isolated) {
> +		sparse->areas[i].offset = 0;
> +		sparse->areas[i].size = end;
> +		return 0;
> +	}
> +
>  	if (vdev->msix_offset & PAGE_MASK) {
>  		sparse->areas[i].offset = 0;
>  		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
> @@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	unsigned int index;
>  	u64 phys_len, req_len, pgoff, req_start;
>  	int ret;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>  
> @@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	if (req_start + req_len > phys_len)
>  		return -EINVAL;
>  
> -	if (index == vdev->msix_bar) {
> +	if (index == vdev->msix_bar && !is_msix_isolated) {
>  		/*
>  		 * Disallow mmaps overlapping the MSI-X table; users don't
>  		 * get to touch this directly.  We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 357243d76f10..7514206a5ea7 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <linux/vgaarb.h>
> +#include <linux/vfio.h>
>  
>  #include "vfio_pci_private.h"
>  
> @@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>  	resource_size_t end;
>  	void __iomem *io;
>  	ssize_t done;
> +	bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> +			IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>  	if (pci_resource_start(pdev, bar))
>  		end = pci_resource_len(pdev, bar);
> @@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>  	} else
>  		io = vdev->barmap[bar];
>  
> -	if (bar == vdev->msix_bar) {
> +	if (bar == vdev->msix_bar && !is_msix_isolated) {
>  		x_start = vdev->msix_offset;
>  		x_end = vdev->msix_offset + vdev->msix_size;
>  	}
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 330d50582f40..5292c4a5ae8f 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
>  
> +bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
> +{
> +	bool ret = false;
> +	struct iommu_group *group = vfio_iommu_group_get(dev);
> +
> +	if (group) {
> +		ret = iommu_group_is_capable(group, cap);
> +
> +		vfio_iommu_group_put(group, dev);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
> +
>  #ifdef CONFIG_VFIO_NOIOMMU
>  static void *vfio_noiommu_open(unsigned long arg)
>  {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-14  9:45   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-14  9:45 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

Folks,

Is there anything to change besides those compiler errors and David's
comment in 5/5? Or the while patchset is too bad? Thanks.



On 07/08/17 17:25, Alexey Kardashevskiy wrote:
> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
> http://www.spinics.net/lists/kvm/msg152232.html
> 
> This time it is using "caps" in IOMMU groups. The main question is if PCI
> bus flags or IOMMU domains are still better (and which one).

> 
> 
> 
> Here is some background:
> 
> Current vfio-pci implementation disallows to mmap the page
> containing MSI-X table in case that users can write directly
> to MSI-X table and generate an incorrect MSIs.
> 
> However, this will cause some performance issue when there
> are some critical device registers in the same page as the
> MSI-X table. We have to handle the mmio access to these
> registers in QEMU emulation rather than in guest.
> 
> To solve this issue, this series allows to expose MSI-X table
> to userspace when hardware enables the capability of interrupt
> remapping which can ensure that a given PCI device can only
> shoot the MSIs assigned for it. And we introduce a new bus_flags
> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> for different archs.
> 
> 
> This is based on sha1
> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
> 
> Please comment. Thanks.
> 
> Changelog:
> 
> v5:
> * redid the whole thing via so-called IOMMU group capabilities
> 
> v4:
> * rebased on recent upstream
> * got all 6 patches from v2 (v3 was missing some)
> 
> 
> 
> 
> Alexey Kardashevskiy (5):
>   iommu: Add capabilities to a group
>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>     remapping
>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>     enabled
>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>   vfio-pci: Allow to expose MSI-X table to userspace when safe
> 
>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>  include/linux/vfio.h             |  1 +
>  arch/powerpc/kernel/iommu.c      |  1 +
>  drivers/iommu/amd_iommu.c        |  3 +++
>  drivers/iommu/intel-iommu.c      |  3 +++
>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>  9 files changed, 99 insertions(+), 4 deletions(-)
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-14  9:45   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-14  9:45 UTC (permalink / raw)
  To: linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA,
	Benjamin Herrenschmidt, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, David Woodhouse,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Gibson

Folks,

Is there anything to change besides those compiler errors and David's
comment in 5/5? Or the while patchset is too bad? Thanks.



On 07/08/17 17:25, Alexey Kardashevskiy wrote:
> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
> http://www.spinics.net/lists/kvm/msg152232.html
> 
> This time it is using "caps" in IOMMU groups. The main question is if PCI
> bus flags or IOMMU domains are still better (and which one).

> 
> 
> 
> Here is some background:
> 
> Current vfio-pci implementation disallows to mmap the page
> containing MSI-X table in case that users can write directly
> to MSI-X table and generate an incorrect MSIs.
> 
> However, this will cause some performance issue when there
> are some critical device registers in the same page as the
> MSI-X table. We have to handle the mmio access to these
> registers in QEMU emulation rather than in guest.
> 
> To solve this issue, this series allows to expose MSI-X table
> to userspace when hardware enables the capability of interrupt
> remapping which can ensure that a given PCI device can only
> shoot the MSIs assigned for it. And we introduce a new bus_flags
> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> for different archs.
> 
> 
> This is based on sha1
> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
> 
> Please comment. Thanks.
> 
> Changelog:
> 
> v5:
> * redid the whole thing via so-called IOMMU group capabilities
> 
> v4:
> * rebased on recent upstream
> * got all 6 patches from v2 (v3 was missing some)
> 
> 
> 
> 
> Alexey Kardashevskiy (5):
>   iommu: Add capabilities to a group
>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>     remapping
>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>     enabled
>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>   vfio-pci: Allow to expose MSI-X table to userspace when safe
> 
>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>  include/linux/vfio.h             |  1 +
>  arch/powerpc/kernel/iommu.c      |  1 +
>  drivers/iommu/amd_iommu.c        |  3 +++
>  drivers/iommu/intel-iommu.c      |  3 +++
>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>  9 files changed, 99 insertions(+), 4 deletions(-)
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-14  9:45   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-14  9:45 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

Folks,

Is there anything to change besides those compiler errors and David's
comment in 5/5? Or the while patchset is too bad? Thanks.



On 07/08/17 17:25, Alexey Kardashevskiy wrote:
> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
> http://www.spinics.net/lists/kvm/msg152232.html
> 
> This time it is using "caps" in IOMMU groups. The main question is if PCI
> bus flags or IOMMU domains are still better (and which one).

> 
> 
> 
> Here is some background:
> 
> Current vfio-pci implementation disallows to mmap the page
> containing MSI-X table in case that users can write directly
> to MSI-X table and generate an incorrect MSIs.
> 
> However, this will cause some performance issue when there
> are some critical device registers in the same page as the
> MSI-X table. We have to handle the mmio access to these
> registers in QEMU emulation rather than in guest.
> 
> To solve this issue, this series allows to expose MSI-X table
> to userspace when hardware enables the capability of interrupt
> remapping which can ensure that a given PCI device can only
> shoot the MSIs assigned for it. And we introduce a new bus_flags
> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> for different archs.
> 
> 
> This is based on sha1
> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
> 
> Please comment. Thanks.
> 
> Changelog:
> 
> v5:
> * redid the whole thing via so-called IOMMU group capabilities
> 
> v4:
> * rebased on recent upstream
> * got all 6 patches from v2 (v3 was missing some)
> 
> 
> 
> 
> Alexey Kardashevskiy (5):
>   iommu: Add capabilities to a group
>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>     remapping
>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>     enabled
>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>   vfio-pci: Allow to expose MSI-X table to userspace when safe
> 
>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>  include/linux/vfio.h             |  1 +
>  arch/powerpc/kernel/iommu.c      |  1 +
>  drivers/iommu/amd_iommu.c        |  3 +++
>  drivers/iommu/intel-iommu.c      |  3 +++
>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>  9 files changed, 99 insertions(+), 4 deletions(-)
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
  2017-08-14  9:45   ` Alexey Kardashevskiy
@ 2017-08-14 13:12     ` Robin Murphy
  -1 siblings, 0 replies; 82+ messages in thread
From: Robin Murphy @ 2017-08-14 13:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> Folks,
> 
> Is there anything to change besides those compiler errors and David's
> comment in 5/5? Or the while patchset is too bad? Thanks.

While I now understand it's not the low-level thing I first thought it
was, so my reasoning has changed, personally I don't like this approach
any more than the previous one - it still smells of abusing external
APIs to pass information from one part of VFIO to another (and it has
the same conceptual problem of attributing something to interrupt
sources that is actually a property of the interrupt target).

Taking a step back, though, why does vfio-pci perform this check in the
first place? If a malicious guest already has control of a device, any
kind of interrupt spoofing it could do by fiddling with the MSI-X
message address/data it could simply do with a DMA write anyway, so the
security argument doesn't stand up in general (sure, not all PCIe
devices may be capable of arbitrary DMA, but that seems like more of a
tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
the fact that we've let a device be assigned at all means that this is
already a non-issue (because either the hardware provides isolation or
the user has explicitly accepted the consequences of an unsafe
configuration) - from patch #4 that's apparently the same for SPAPR TCE,
in which case it seems this flag doesn't even need to be propagated and
could simply be assumed always.

On the other hand, if the check is not so much to mitigate malicious
guests attacking the system as to prevent dumb guests breaking
themselves (e.g. if some or all of the MSI-X capability is actually
emulated), then allowing things to sometimes go wrong on the grounds of
an irrelevant hardware feature doesn't seem correct :/

Robin.

> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>> http://www.spinics.net/lists/kvm/msg152232.html
>>
>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>> bus flags or IOMMU domains are still better (and which one).
> 
>>
>>
>>
>> Here is some background:
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>>
>> This is based on sha1
>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>
>> Please comment. Thanks.
>>
>> Changelog:
>>
>> v5:
>> * redid the whole thing via so-called IOMMU group capabilities
>>
>> v4:
>> * rebased on recent upstream
>> * got all 6 patches from v2 (v3 was missing some)
>>
>>
>>
>>
>> Alexey Kardashevskiy (5):
>>   iommu: Add capabilities to a group
>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>     remapping
>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>     enabled
>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>
>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>  include/linux/vfio.h             |  1 +
>>  arch/powerpc/kernel/iommu.c      |  1 +
>>  drivers/iommu/amd_iommu.c        |  3 +++
>>  drivers/iommu/intel-iommu.c      |  3 +++
>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>
> 
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-14 13:12     ` Robin Murphy
  0 siblings, 0 replies; 82+ messages in thread
From: Robin Murphy @ 2017-08-14 13:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> Folks,
> 
> Is there anything to change besides those compiler errors and David's
> comment in 5/5? Or the while patchset is too bad? Thanks.

While I now understand it's not the low-level thing I first thought it
was, so my reasoning has changed, personally I don't like this approach
any more than the previous one - it still smells of abusing external
APIs to pass information from one part of VFIO to another (and it has
the same conceptual problem of attributing something to interrupt
sources that is actually a property of the interrupt target).

Taking a step back, though, why does vfio-pci perform this check in the
first place? If a malicious guest already has control of a device, any
kind of interrupt spoofing it could do by fiddling with the MSI-X
message address/data it could simply do with a DMA write anyway, so the
security argument doesn't stand up in general (sure, not all PCIe
devices may be capable of arbitrary DMA, but that seems like more of a
tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
the fact that we've let a device be assigned at all means that this is
already a non-issue (because either the hardware provides isolation or
the user has explicitly accepted the consequences of an unsafe
configuration) - from patch #4 that's apparently the same for SPAPR TCE,
in which case it seems this flag doesn't even need to be propagated and
could simply be assumed always.

On the other hand, if the check is not so much to mitigate malicious
guests attacking the system as to prevent dumb guests breaking
themselves (e.g. if some or all of the MSI-X capability is actually
emulated), then allowing things to sometimes go wrong on the grounds of
an irrelevant hardware feature doesn't seem correct :/

Robin.

> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>> http://www.spinics.net/lists/kvm/msg152232.html
>>
>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>> bus flags or IOMMU domains are still better (and which one).
> 
>>
>>
>>
>> Here is some background:
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>>
>> This is based on sha1
>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>
>> Please comment. Thanks.
>>
>> Changelog:
>>
>> v5:
>> * redid the whole thing via so-called IOMMU group capabilities
>>
>> v4:
>> * rebased on recent upstream
>> * got all 6 patches from v2 (v3 was missing some)
>>
>>
>>
>>
>> Alexey Kardashevskiy (5):
>>   iommu: Add capabilities to a group
>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>     remapping
>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>     enabled
>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>
>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>  include/linux/vfio.h             |  1 +
>>  arch/powerpc/kernel/iommu.c      |  1 +
>>  drivers/iommu/amd_iommu.c        |  3 +++
>>  drivers/iommu/intel-iommu.c      |  3 +++
>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>
> 
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:16       ` Jike Song
  0 siblings, 0 replies; 82+ messages in thread
From: Jike Song @ 2017-08-15  1:16 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc, kvm,
	Yongji Xie, Eric Auger, Kyle Mahlkuch, Alex Williamson,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

On 08/14/2017 09:12 PM, Robin Murphy wrote:
> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
> 
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
> 
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me).

Hi Robin,

DMA writes will be translated (thereby censored) by DMA Remapping hardware,
while MSI/MSI-X will not. Is this different for non-x86?

--
Thanks,
Jike

> Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
> 
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/
> 
> Robin.
> 
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>>   iommu: Add capabilities to a group
>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>     remapping
>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>     enabled
>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>  include/linux/vfio.h             |  1 +
>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:16       ` Jike Song
  0 siblings, 0 replies; 82+ messages in thread
From: Jike Song @ 2017-08-15  1:16 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA,
	Alexey Kardashevskiy, Benjamin Herrenschmidt, Kyle Mahlkuch,
	Kirti Wankhede, kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	David Woodhouse, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	David Gibson

On 08/14/2017 09:12 PM, Robin Murphy wrote:
> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
> 
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
> 
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me).

Hi Robin,

DMA writes will be translated (thereby censored) by DMA Remapping hardware,
while MSI/MSI-X will not. Is this different for non-x86?

--
Thanks,
Jike

> Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
> 
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/
> 
> Robin.
> 
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>>   iommu: Add capabilities to a group
>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>     remapping
>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>     enabled
>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>  include/linux/vfio.h             |  1 +
>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:16       ` Jike Song
  0 siblings, 0 replies; 82+ messages in thread
From: Jike Song @ 2017-08-15  1:16 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc, kvm,
	Yongji Xie, Eric Auger, Kyle Mahlkuch, Alex Williamson,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav,
	Benjamin Herrenschmidt, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

On 08/14/2017 09:12 PM, Robin Murphy wrote:
> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
> 
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
> 
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me).

Hi Robin,

DMA writes will be translated (thereby censored) by DMA Remapping hardware,
while MSI/MSI-X will not. Is this different for non-x86?

--
Thanks,
Jike

> Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
> 
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/
> 
> Robin.
> 
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>>   iommu: Add capabilities to a group
>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>     remapping
>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>     enabled
>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>  include/linux/vfio.h             |  1 +
>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:33         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  1:33 UTC (permalink / raw)
  To: Jike Song, Robin Murphy
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc, kvm,
	Yongji Xie, Eric Auger, Kyle Mahlkuch, Alex Williamson,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav, David Woodhouse,
	Kirti Wankhede, Mauricio Faria de Oliveira, Neo Jia,
	Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > Taking a step back, though, why does vfio-pci perform this check in the
> > first place? If a malicious guest already has control of a device, any
> > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > message address/data it could simply do with a DMA write anyway, so the
> > security argument doesn't stand up in general (sure, not all PCIe
> > devices may be capable of arbitrary DMA, but that seems like more of a
> > tenuous security-by-obscurity angle to me).

I tried to make that point for years, thanks for re-iterating it :-)

> Hi Robin,
> 
> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> while MSI/MSI-X will not. Is this different for non-x86?

There is no way your DMA remapping HW can differenciate. The only
difference between a DMA write and an MSI is ... the address. So if I
can make my device DMA to the MSI address range, I've defeated your
security.

The table obfuscating in qemu is only useful as an insecure way of
"making things sort-of-work" for HW that doesnt have proper remapping
or filtering.

On pseries we don't have that problem because:

 1) Our hypervisor (which is qemu) provide the DMA address for MSIs/X
so there is no need for "magic remapping" to give the guest a value
that works.

 2) Our HW (configured by VFIO/KVM) filters which device can DMA to
what address (including which MSIs/X) thus even if the guest doesn't
use the address passed and messes around, it can only shoot itself in
the foot.

So all we need is a way to tell qemu to stop doing that filtering on
our platform. This is *one bit* of information, it's taken 3 years of
arguments and we still don't have a solution. In the meantime,
workloads *are* being hurt by significant performance degradation due
to the MSI-X table sharing a 64K page (our page size) with other MMIOs.

Yay !

Ben.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:33         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  1:33 UTC (permalink / raw)
  To: Jike Song, Robin Murphy
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA,
	Alexey Kardashevskiy, David Woodhouse, Kyle Mahlkuch,
	Kirti Wankhede, kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Gibson

On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > Taking a step back, though, why does vfio-pci perform this check in the
> > first place? If a malicious guest already has control of a device, any
> > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > message address/data it could simply do with a DMA write anyway, so the
> > security argument doesn't stand up in general (sure, not all PCIe
> > devices may be capable of arbitrary DMA, but that seems like more of a
> > tenuous security-by-obscurity angle to me).

I tried to make that point for years, thanks for re-iterating it :-)

> Hi Robin,
> 
> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> while MSI/MSI-X will not. Is this different for non-x86?

There is no way your DMA remapping HW can differenciate. The only
difference between a DMA write and an MSI is ... the address. So if I
can make my device DMA to the MSI address range, I've defeated your
security.

The table obfuscating in qemu is only useful as an insecure way of
"making things sort-of-work" for HW that doesnt have proper remapping
or filtering.

On pseries we don't have that problem because:

 1) Our hypervisor (which is qemu) provide the DMA address for MSIs/X
so there is no need for "magic remapping" to give the guest a value
that works.

 2) Our HW (configured by VFIO/KVM) filters which device can DMA to
what address (including which MSIs/X) thus even if the guest doesn't
use the address passed and messes around, it can only shoot itself in
the foot.

So all we need is a way to tell qemu to stop doing that filtering on
our platform. This is *one bit* of information, it's taken 3 years of
arguments and we still don't have a solution. In the meantime,
workloads *are* being hurt by significant performance degradation due
to the MSI-X table sharing a 64K page (our page size) with other MMIOs.

Yay !

Ben.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:33         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  1:33 UTC (permalink / raw)
  To: Jike Song, Robin Murphy
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc, kvm,
	Yongji Xie, Eric Auger, Kyle Mahlkuch, Alex Williamson,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav, David Woodhouse,
	Kirti Wankhede, Mauricio Faria de Oliveira, Neo Jia,
	Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > Taking a step back, though, why does vfio-pci perform this check in the
> > first place? If a malicious guest already has control of a device, any
> > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > message address/data it could simply do with a DMA write anyway, so the
> > security argument doesn't stand up in general (sure, not all PCIe
> > devices may be capable of arbitrary DMA, but that seems like more of a
> > tenuous security-by-obscurity angle to me).

I tried to make that point for years, thanks for re-iterating it :-)

> Hi Robin,
> 
> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> while MSI/MSI-X will not. Is this different for non-x86?

There is no way your DMA remapping HW can differenciate. The only
difference between a DMA write and an MSI is ... the address. So if I
can make my device DMA to the MSI address range, I've defeated your
security.

The table obfuscating in qemu is only useful as an insecure way of
"making things sort-of-work" for HW that doesnt have proper remapping
or filtering.

On pseries we don't have that problem because:

 1) Our hypervisor (which is qemu) provide the DMA address for MSIs/X
so there is no need for "magic remapping" to give the guest a value
that works.

 2) Our HW (configured by VFIO/KVM) filters which device can DMA to
what address (including which MSIs/X) thus even if the guest doesn't
use the address passed and messes around, it can only shoot itself in
the foot.

So all we need is a way to tell qemu to stop doing that filtering on
our platform. This is *one bit* of information, it's taken 3 years of
arguments and we still don't have a solution. In the meantime,
workloads *are* being hurt by significant performance degradation due
to the MSI-X table sharing a 64K page (our page size) with other MMIOs.

Yay !

Ben.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:47           ` Jike Song
  0 siblings, 0 replies; 82+ messages in thread
From: Jike Song @ 2017-08-15  1:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch,
	Alex Williamson, Bjorn Helgaas, Joerg Roedel, Arvind Yadav,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
>>> Taking a step back, though, why does vfio-pci perform this check in the
>>> first place? If a malicious guest already has control of a device, any
>>> kind of interrupt spoofing it could do by fiddling with the MSI-X
>>> message address/data it could simply do with a DMA write anyway, so the
>>> security argument doesn't stand up in general (sure, not all PCIe
>>> devices may be capable of arbitrary DMA, but that seems like more of a
>>> tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)
> 
>> Hi Robin,
>>
>> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
>> while MSI/MSI-X will not. Is this different for non-x86?
> 
> There is no way your DMA remapping HW can differenciate. The only
> difference between a DMA write and an MSI is ... the address. So if I
> can make my device DMA to the MSI address range, I've defeated your
> security.

I don't think with IRQ remapping enabled, you can make your device DMA to
MSI address, without being treated as an IRQ and remapped. If so, the IRQ
remapping hardware is simply broken :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:47           ` Jike Song
  0 siblings, 0 replies; 82+ messages in thread
From: Jike Song @ 2017-08-15  1:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA,
	Alexey Kardashevskiy, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	David Woodhouse, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	David Gibson

On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
>>> Taking a step back, though, why does vfio-pci perform this check in the
>>> first place? If a malicious guest already has control of a device, any
>>> kind of interrupt spoofing it could do by fiddling with the MSI-X
>>> message address/data it could simply do with a DMA write anyway, so the
>>> security argument doesn't stand up in general (sure, not all PCIe
>>> devices may be capable of arbitrary DMA, but that seems like more of a
>>> tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)
> 
>> Hi Robin,
>>
>> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
>> while MSI/MSI-X will not. Is this different for non-x86?
> 
> There is no way your DMA remapping HW can differenciate. The only
> difference between a DMA write and an MSI is ... the address. So if I
> can make my device DMA to the MSI address range, I've defeated your
> security.

I don't think with IRQ remapping enabled, you can make your device DMA to
MSI address, without being treated as an IRQ and remapped. If so, the IRQ
remapping hardware is simply broken :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  1:47           ` Jike Song
  0 siblings, 0 replies; 82+ messages in thread
From: Jike Song @ 2017-08-15  1:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch,
	Alex Williamson, Bjorn Helgaas, Joerg Roedel, Arvind Yadav,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
>>> Taking a step back, though, why does vfio-pci perform this check in the
>>> first place? If a malicious guest already has control of a device, any
>>> kind of interrupt spoofing it could do by fiddling with the MSI-X
>>> message address/data it could simply do with a DMA write anyway, so the
>>> security argument doesn't stand up in general (sure, not all PCIe
>>> devices may be capable of arbitrary DMA, but that seems like more of a
>>> tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)
> 
>> Hi Robin,
>>
>> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
>> while MSI/MSI-X will not. Is this different for non-x86?
> 
> There is no way your DMA remapping HW can differenciate. The only
> difference between a DMA write and an MSI is ... the address. So if I
> can make my device DMA to the MSI address range, I've defeated your
> security.

I don't think with IRQ remapping enabled, you can make your device DMA to
MSI address, without being treated as an IRQ and remapped. If so, the IRQ
remapping hardware is simply broken :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  5:38             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  5:38 UTC (permalink / raw)
  To: Jike Song
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch,
	Alex Williamson, Bjorn Helgaas, Joerg Roedel, Arvind Yadav,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Tue, 2017-08-15 at 09:47 +0800, Jike Song wrote:
> On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > > Taking a step back, though, why does vfio-pci perform this check in the
> > > > first place? If a malicious guest already has control of a device, any
> > > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > > message address/data it could simply do with a DMA write anyway, so the
> > > > security argument doesn't stand up in general (sure, not all PCIe
> > > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > > tenuous security-by-obscurity angle to me).
> > 
> > I tried to make that point for years, thanks for re-iterating it :-)
> > 
> > > Hi Robin,
> > > 
> > > DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> > > while MSI/MSI-X will not. Is this different for non-x86?
> > 
> > There is no way your DMA remapping HW can differenciate. The only
> > difference between a DMA write and an MSI is ... the address. So if I
> > can make my device DMA to the MSI address range, I've defeated your
> > security.
> 
> I don't think with IRQ remapping enabled, you can make your device DMA to
> MSI address, without being treated as an IRQ and remapped. If so, the IRQ
> remapping hardware is simply broken :)

You are mixing things here.

Robin's point is that there is no security provided by the obfuscating
of the MSI-X table by qemu because whatever qemu does to "filter" the
MSI-X targer addresses can be worked around by making the device DMA
wherever you want.

None of what you say invalidates that basic fact.

Now, as far as your remapping HW goes, either it filters interrupts or
it doesn't. If it does then yes, it can't be spoofed, and thus you
don't need the filtering of the table in qemu.

If it doesn't, then the guest can spoof any interrupt using DMAs and
whatever qemu does to filter the table is not going to fix it.

Thus the point remains that the only value in qemu filtering the table
is to enable already insecure use cases to work, without actually
making them any more secure.

Ben.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  5:38             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  5:38 UTC (permalink / raw)
  To: Jike Song
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA,
	Alexey Kardashevskiy, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	David Woodhouse, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	David Gibson

On Tue, 2017-08-15 at 09:47 +0800, Jike Song wrote:
> On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > > Taking a step back, though, why does vfio-pci perform this check in the
> > > > first place? If a malicious guest already has control of a device, any
> > > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > > message address/data it could simply do with a DMA write anyway, so the
> > > > security argument doesn't stand up in general (sure, not all PCIe
> > > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > > tenuous security-by-obscurity angle to me).
> > 
> > I tried to make that point for years, thanks for re-iterating it :-)
> > 
> > > Hi Robin,
> > > 
> > > DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> > > while MSI/MSI-X will not. Is this different for non-x86?
> > 
> > There is no way your DMA remapping HW can differenciate. The only
> > difference between a DMA write and an MSI is ... the address. So if I
> > can make my device DMA to the MSI address range, I've defeated your
> > security.
> 
> I don't think with IRQ remapping enabled, you can make your device DMA to
> MSI address, without being treated as an IRQ and remapped. If so, the IRQ
> remapping hardware is simply broken :)

You are mixing things here.

Robin's point is that there is no security provided by the obfuscating
of the MSI-X table by qemu because whatever qemu does to "filter" the
MSI-X targer addresses can be worked around by making the device DMA
wherever you want.

None of what you say invalidates that basic fact.

Now, as far as your remapping HW goes, either it filters interrupts or
it doesn't. If it does then yes, it can't be spoofed, and thus you
don't need the filtering of the table in qemu.

If it doesn't, then the guest can spoof any interrupt using DMAs and
whatever qemu does to filter the table is not going to fix it.

Thus the point remains that the only value in qemu filtering the table
is to enable already insecure use cases to work, without actually
making them any more secure.

Ben.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  5:38             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  5:38 UTC (permalink / raw)
  To: Jike Song
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch,
	Alex Williamson, Bjorn Helgaas, Joerg Roedel, Arvind Yadav,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Tue, 2017-08-15 at 09:47 +0800, Jike Song wrote:
> On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > > Taking a step back, though, why does vfio-pci perform this check in the
> > > > first place? If a malicious guest already has control of a device, any
> > > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > > message address/data it could simply do with a DMA write anyway, so the
> > > > security argument doesn't stand up in general (sure, not all PCIe
> > > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > > tenuous security-by-obscurity angle to me).
> > 
> > I tried to make that point for years, thanks for re-iterating it :-)
> > 
> > > Hi Robin,
> > > 
> > > DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> > > while MSI/MSI-X will not. Is this different for non-x86?
> > 
> > There is no way your DMA remapping HW can differenciate. The only
> > difference between a DMA write and an MSI is ... the address. So if I
> > can make my device DMA to the MSI address range, I've defeated your
> > security.
> 
> I don't think with IRQ remapping enabled, you can make your device DMA to
> MSI address, without being treated as an IRQ and remapped. If so, the IRQ
> remapping hardware is simply broken :)

You are mixing things here.

Robin's point is that there is no security provided by the obfuscating
of the MSI-X table by qemu because whatever qemu does to "filter" the
MSI-X targer addresses can be worked around by making the device DMA
wherever you want.

None of what you say invalidates that basic fact.

Now, as far as your remapping HW goes, either it filters interrupts or
it doesn't. If it does then yes, it can't be spoofed, and thus you
don't need the filtering of the table in qemu.

If it doesn't, then the guest can spoof any interrupt using DMAs and
whatever qemu does to filter the table is not going to fix it.

Thus the point remains that the only value in qemu filtering the table
is to enable already insecure use cases to work, without actually
making them any more secure.

Ben.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  5:42       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  5:42 UTC (permalink / raw)
  To: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Joerg Roedel, Arvind Yadav, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

On Mon, 2017-08-14 at 14:12 +0100, Robin Murphy wrote:
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

There is 0 value in trying to prevent the guest kernel from shooting
itself in the foot. There are so many other ways it can do it that I
fail the point of even attempting it here.

In addition, this actually harms performance on some devices. There
are cases where the MSI-X table shares a page with other registrers
that are used during normal device operation. This is especially
problematic on architectures such as powerpc that use 64K pages.

Those devices thus suffer a massive performance loss, for the sake of
something that never happens in practice (especially on pseries where
the MSI configuration is done by paravirt calls, thus by qemu itself).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  5:42       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  5:42 UTC (permalink / raw)
  To: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Vlad Tsyrklevich, Kyle Mahlkuch,
	Kirti Wankhede, kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, David Woodhouse,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Gibson

On Mon, 2017-08-14 at 14:12 +0100, Robin Murphy wrote:
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

There is 0 value in trying to prevent the guest kernel from shooting
itself in the foot. There are so many other ways it can do it that I
fail the point of even attempting it here.

In addition, this actually harms performance on some devices. There
are cases where the MSI-X table shares a page with other registrers
that are used during normal device operation. This is especially
problematic on architectures such as powerpc that use 64K pages.

Those devices thus suffer a massive performance loss, for the sake of
something that never happens in practice (especially on pseries where
the MSI configuration is done by paravirt calls, thus by qemu itself).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15  5:42       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-15  5:42 UTC (permalink / raw)
  To: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Joerg Roedel, Arvind Yadav, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

On Mon, 2017-08-14 at 14:12 +0100, Robin Murphy wrote:
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

There is 0 value in trying to prevent the guest kernel from shooting
itself in the foot. There are so many other ways it can do it that I
fail the point of even attempting it here.

In addition, this actually harms performance on some devices. There
are cases where the MSI-X table shares a page with other registrers
that are used during normal device operation. This is especially
problematic on architectures such as powerpc that use 64K pages.

Those devices thus suffer a massive performance loss, for the sake of
something that never happens in practice (especially on pseries where
the MSI configuration is done by paravirt calls, thus by qemu itself).

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
  2017-08-15  1:33         ` Benjamin Herrenschmidt
                             ` (2 preceding siblings ...)
  (?)
@ 2017-08-15 14:48           ` David Laight
  -1 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2017-08-15 14:48 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Jike Song, Robin Murphy
  Cc: Vlad Tsyrklevich, Neo Jia, kvm, Eric Auger, Alexey Kardashevskiy,
	David Woodhouse, Joerg Roedel, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc, iommu, Yongji Xie, Alex Williamson,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev, linux-kernel, David Gibson

From: Benjamin Herrenschmidt
> Sent: 15 August 2017 02:34
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > Taking a step back, though, why does vfio-pci perform this check in the
> > > first place? If a malicious guest already has control of a device, any
> > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > message address/data it could simply do with a DMA write anyway, so the
> > > security argument doesn't stand up in general (sure, not all PCIe
> > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)

Indeed, we have an FPGA based PCIe card where the MSI-X table is just a
piece of PCIe accessible memory.
The device driver has to read the MSI-X table and write the address+data
values to other registers which are then used to raise the interrupt.
(Ok, I've written a better interrupt generator so we don't do that
any more.)

	David

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15 14:48           ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2017-08-15 14:48 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Jike Song, Robin Murphy
  Cc: Vlad Tsyrklevich, Neo Jia, kvm, Eric Auger, Alexey Kardashevskiy,
	David Woodhouse, Joerg Roedel, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc, iommu, Yongji Xie, Alex Williamson,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev@lists.ozlabs.org

From: Benjamin Herrenschmidt
> Sent: 15 August 2017 02:34
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > Taking a step back, though, why does vfio-pci perform this check in the
> > > first place? If a malicious guest already has control of a device, any
> > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > message address/data it could simply do with a DMA write anyway, so the
> > > security argument doesn't stand up in general (sure, not all PCIe
> > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)

Indeed, we have an FPGA based PCIe card where the MSI-X table is just a
piece of PCIe accessible memory.
The device driver has to read the MSI-X table and write the address+data
values to other registers which are then used to raise the interrupt.
(Ok, I've written a better interrupt generator so we don't do that
any more.)

	David


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15 14:48           ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2017-08-15 14:48 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Jike Song, Robin Murphy
  Cc: Vlad Tsyrklevich, Neo Jia, kvm, Eric Auger, Alexey Kardashevskiy,
	David Woodhouse, Joerg Roedel, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc, iommu, Yongji Xie, Alex Williamson,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev, linux-kernel, David Gibson

RnJvbTogQmVuamFtaW4gSGVycmVuc2NobWlkdA0KPiBTZW50OiAxNSBBdWd1c3QgMjAxNyAwMjoz
NA0KPiBPbiBUdWUsIDIwMTctMDgtMTUgYXQgMDk6MTYgKzA4MDAsIEppa2UgU29uZyB3cm90ZToN
Cj4gPiA+IFRha2luZyBhIHN0ZXAgYmFjaywgdGhvdWdoLCB3aHkgZG9lcyB2ZmlvLXBjaSBwZXJm
b3JtIHRoaXMgY2hlY2sgaW4gdGhlDQo+ID4gPiBmaXJzdCBwbGFjZT8gSWYgYSBtYWxpY2lvdXMg
Z3Vlc3QgYWxyZWFkeSBoYXMgY29udHJvbCBvZiBhIGRldmljZSwgYW55DQo+ID4gPiBraW5kIG9m
IGludGVycnVwdCBzcG9vZmluZyBpdCBjb3VsZCBkbyBieSBmaWRkbGluZyB3aXRoIHRoZSBNU0kt
WA0KPiA+ID4gbWVzc2FnZSBhZGRyZXNzL2RhdGEgaXQgY291bGQgc2ltcGx5IGRvIHdpdGggYSBE
TUEgd3JpdGUgYW55d2F5LCBzbyB0aGUNCj4gPiA+IHNlY3VyaXR5IGFyZ3VtZW50IGRvZXNuJ3Qg
c3RhbmQgdXAgaW4gZ2VuZXJhbCAoc3VyZSwgbm90IGFsbCBQQ0llDQo+ID4gPiBkZXZpY2VzIG1h
eSBiZSBjYXBhYmxlIG9mIGFyYml0cmFyeSBETUEsIGJ1dCB0aGF0IHNlZW1zIGxpa2UgbW9yZSBv
ZiBhDQo+ID4gPiB0ZW51b3VzIHNlY3VyaXR5LWJ5LW9ic2N1cml0eSBhbmdsZSB0byBtZSkuDQo+
IA0KPiBJIHRyaWVkIHRvIG1ha2UgdGhhdCBwb2ludCBmb3IgeWVhcnMsIHRoYW5rcyBmb3IgcmUt
aXRlcmF0aW5nIGl0IDotKQ0KDQpJbmRlZWQsIHdlIGhhdmUgYW4gRlBHQSBiYXNlZCBQQ0llIGNh
cmQgd2hlcmUgdGhlIE1TSS1YIHRhYmxlIGlzIGp1c3QgYQ0KcGllY2Ugb2YgUENJZSBhY2Nlc3Np
YmxlIG1lbW9yeS4NClRoZSBkZXZpY2UgZHJpdmVyIGhhcyB0byByZWFkIHRoZSBNU0ktWCB0YWJs
ZSBhbmQgd3JpdGUgdGhlIGFkZHJlc3MrZGF0YQ0KdmFsdWVzIHRvIG90aGVyIHJlZ2lzdGVycyB3
aGljaCBhcmUgdGhlbiB1c2VkIHRvIHJhaXNlIHRoZSBpbnRlcnJ1cHQuDQooT2ssIEkndmUgd3Jp
dHRlbiBhIGJldHRlciBpbnRlcnJ1cHQgZ2VuZXJhdG9yIHNvIHdlIGRvbid0IGRvIHRoYXQNCmFu
eSBtb3JlLikNCg0KCURhdmlkDQoNCg==

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15 14:48           ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2017-08-15 14:48 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Jike Song, Robin Murphy
  Cc: Vlad Tsyrklevich, Neo Jia, kvm, Eric Auger, Alexey Kardashevskiy,
	David Woodhouse, Joerg Roedel, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc, iommu, Yongji Xie, Alex Williamson,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev@lists.ozlabs.org

From: Benjamin Herrenschmidt
> Sent: 15 August 2017 02:34
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > Taking a step back, though, why does vfio-pci perform this check in the
> > > first place? If a malicious guest already has control of a device, any
> > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > message address/data it could simply do with a DMA write anyway, so the
> > > security argument doesn't stand up in general (sure, not all PCIe
> > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)

Indeed, we have an FPGA based PCIe card where the MSI-X table is just a
piece of PCIe accessible memory.
The device driver has to read the MSI-X table and write the address+data
values to other registers which are then used to raise the interrupt.
(Ok, I've written a better interrupt generator so we don't do that
any more.)

	David


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15 14:48           ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2017-08-15 14:48 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Jike Song, Robin Murphy
  Cc: Vlad Tsyrklevich, Neo Jia, kvm, Eric Auger, Alexey Kardashevskiy,
	David Woodhouse, Joerg Roedel, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc, iommu, Yongji Xie, Alex Williamson,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev, linux-kernel, David Gibson

RnJvbTogQmVuamFtaW4gSGVycmVuc2NobWlkdA0KPiBTZW50OiAxNSBBdWd1c3QgMjAxNyAwMjoz
NA0KPiBPbiBUdWUsIDIwMTctMDgtMTUgYXQgMDk6MTYgKzA4MDAsIEppa2UgU29uZyB3cm90ZToN
Cj4gPiA+IFRha2luZyBhIHN0ZXAgYmFjaywgdGhvdWdoLCB3aHkgZG9lcyB2ZmlvLXBjaSBwZXJm
b3JtIHRoaXMgY2hlY2sgaW4gdGhlDQo+ID4gPiBmaXJzdCBwbGFjZT8gSWYgYSBtYWxpY2lvdXMg
Z3Vlc3QgYWxyZWFkeSBoYXMgY29udHJvbCBvZiBhIGRldmljZSwgYW55DQo+ID4gPiBraW5kIG9m
IGludGVycnVwdCBzcG9vZmluZyBpdCBjb3VsZCBkbyBieSBmaWRkbGluZyB3aXRoIHRoZSBNU0kt
WA0KPiA+ID4gbWVzc2FnZSBhZGRyZXNzL2RhdGEgaXQgY291bGQgc2ltcGx5IGRvIHdpdGggYSBE
TUEgd3JpdGUgYW55d2F5LCBzbyB0aGUNCj4gPiA+IHNlY3VyaXR5IGFyZ3VtZW50IGRvZXNuJ3Qg
c3RhbmQgdXAgaW4gZ2VuZXJhbCAoc3VyZSwgbm90IGFsbCBQQ0llDQo+ID4gPiBkZXZpY2VzIG1h
eSBiZSBjYXBhYmxlIG9mIGFyYml0cmFyeSBETUEsIGJ1dCB0aGF0IHNlZW1zIGxpa2UgbW9yZSBv
ZiBhDQo+ID4gPiB0ZW51b3VzIHNlY3VyaXR5LWJ5LW9ic2N1cml0eSBhbmdsZSB0byBtZSkuDQo+
IA0KPiBJIHRyaWVkIHRvIG1ha2UgdGhhdCBwb2ludCBmb3IgeWVhcnMsIHRoYW5rcyBmb3IgcmUt
aXRlcmF0aW5nIGl0IDotKQ0KDQpJbmRlZWQsIHdlIGhhdmUgYW4gRlBHQSBiYXNlZCBQQ0llIGNh
cmQgd2hlcmUgdGhlIE1TSS1YIHRhYmxlIGlzIGp1c3QgYQ0KcGllY2Ugb2YgUENJZSBhY2Nlc3Np
YmxlIG1lbW9yeS4NClRoZSBkZXZpY2UgZHJpdmVyIGhhcyB0byByZWFkIHRoZSBNU0ktWCB0YWJs
ZSBhbmQgd3JpdGUgdGhlIGFkZHJlc3MrZGF0YQ0KdmFsdWVzIHRvIG90aGVyIHJlZ2lzdGVycyB3
aGljaCBhcmUgdGhlbiB1c2VkIHRvIHJhaXNlIHRoZSBpbnRlcnJ1cHQuDQooT2ssIEkndmUgd3Jp
dHRlbiBhIGJldHRlciBpbnRlcnJ1cHQgZ2VuZXJhdG9yIHNvIHdlIGRvbid0IGRvIHRoYXQNCmFu
eSBtb3JlLikNCg0KCURhdmlkDQoNCg=

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15 16:37       ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-15 16:37 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc, kvm,
	Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song, Bjorn Helgaas,
	Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Mon, 14 Aug 2017 14:12:33 +0100
Robin Murphy <robin.murphy@arm.com> wrote:

> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> > Folks,
> > 
> > Is there anything to change besides those compiler errors and David's
> > comment in 5/5? Or the while patchset is too bad? Thanks.  
> 
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
> 
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
> 
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

While the theoretical security provided by preventing direct access to
the MSI-X vector table may be mostly a matter of obfuscation, in
practice, I think it changes the problem of creating arbitrary DMA
writes from a generic, trivial, PCI spec based exercise to a more device
specific challenge.  I do however have evidence that there are
consumers of the vfio API who would have attempted to program device
interrupts by directly manipulating the vector table had they not been
prevented from doing so and contacting me to learn about the SET_IRQ
ioctl.  Therefore I think the behavior also contributes to making the
overall API more difficult to use incorrectly.

Of course I don't think either of those are worth imposing a
performance penalty where we don't otherwise need one.  However, if we
look at a VM scenario where the guest is following the PCI standard for
programming MSI-X interrupts (ie. not POWER), we need some mechanism to
intercept those MMIO writes to the vector table and configure the host
interrupt domain of the device rather than allowing the guest direct
access.  This is simply part of virtualizing the device to the guest.
So even if the kernel allows mmap'ing the vector table, the hypervisor
needs to trap it, so the mmap isn't required or used anyway.  It's only
when you define a non-PCI standard for your guest to program
interrupts, as POWER has done, and can therefore trust that the
hypervisor does not need to trap on the vector table that having that
mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
pages too... does ARM have any strategy that would actually make it
possible to make use of an mmap covering the vector table?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15 16:37       ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-15 16:37 UTC (permalink / raw)
  To: Robin Murphy
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Alexey Kardashevskiy,
	Benjamin Herrenschmidt, Kyle Mahlkuch, Vlad Tsyrklevich,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Kirti Wankhede, Mauricio Faria de Oliveira, Paul Mackerras,
	Bjorn Helgaas, Arvind Yadav, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	David Woodhouse, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	David Gibson

On Mon, 14 Aug 2017 14:12:33 +0100
Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org> wrote:

> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> > Folks,
> > 
> > Is there anything to change besides those compiler errors and David's
> > comment in 5/5? Or the while patchset is too bad? Thanks.  
> 
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
> 
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
> 
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

While the theoretical security provided by preventing direct access to
the MSI-X vector table may be mostly a matter of obfuscation, in
practice, I think it changes the problem of creating arbitrary DMA
writes from a generic, trivial, PCI spec based exercise to a more device
specific challenge.  I do however have evidence that there are
consumers of the vfio API who would have attempted to program device
interrupts by directly manipulating the vector table had they not been
prevented from doing so and contacting me to learn about the SET_IRQ
ioctl.  Therefore I think the behavior also contributes to making the
overall API more difficult to use incorrectly.

Of course I don't think either of those are worth imposing a
performance penalty where we don't otherwise need one.  However, if we
look at a VM scenario where the guest is following the PCI standard for
programming MSI-X interrupts (ie. not POWER), we need some mechanism to
intercept those MMIO writes to the vector table and configure the host
interrupt domain of the device rather than allowing the guest direct
access.  This is simply part of virtualizing the device to the guest.
So even if the kernel allows mmap'ing the vector table, the hypervisor
needs to trap it, so the mmap isn't required or used anyway.  It's only
when you define a non-PCI standard for your guest to program
interrupts, as POWER has done, and can therefore trust that the
hypervisor does not need to trap on the vector table that having that
mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
pages too... does ARM have any strategy that would actually make it
possible to make use of an mmap covering the vector table?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-15 16:37       ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-15 16:37 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc, kvm,
	Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song, Bjorn Helgaas,
	Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Mon, 14 Aug 2017 14:12:33 +0100
Robin Murphy <robin.murphy@arm.com> wrote:

> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> > Folks,
> > 
> > Is there anything to change besides those compiler errors and David's
> > comment in 5/5? Or the while patchset is too bad? Thanks.  
> 
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
> 
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
> 
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

While the theoretical security provided by preventing direct access to
the MSI-X vector table may be mostly a matter of obfuscation, in
practice, I think it changes the problem of creating arbitrary DMA
writes from a generic, trivial, PCI spec based exercise to a more device
specific challenge.  I do however have evidence that there are
consumers of the vfio API who would have attempted to program device
interrupts by directly manipulating the vector table had they not been
prevented from doing so and contacting me to learn about the SET_IRQ
ioctl.  Therefore I think the behavior also contributes to making the
overall API more difficult to use incorrectly.

Of course I don't think either of those are worth imposing a
performance penalty where we don't otherwise need one.  However, if we
look at a VM scenario where the guest is following the PCI standard for
programming MSI-X interrupts (ie. not POWER), we need some mechanism to
intercept those MMIO writes to the vector table and configure the host
interrupt domain of the device rather than allowing the guest direct
access.  This is simply part of virtualizing the device to the guest.
So even if the kernel allows mmap'ing the vector table, the hypervisor
needs to trap it, so the mmap isn't required or used anyway.  It's only
when you define a non-PCI standard for your guest to program
interrupts, as POWER has done, and can therefore trust that the
hypervisor does not need to trap on the vector table that having that
mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
pages too... does ARM have any strategy that would actually make it
possible to make use of an mmap covering the vector table?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-16  0:35         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-16  0:35 UTC (permalink / raw)
  To: Alex Williamson, Robin Murphy
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc, kvm,
	Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song, Bjorn Helgaas,
	Joerg Roedel, Arvind Yadav, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> Of course I don't think either of those are worth imposing a
> performance penalty where we don't otherwise need one.  However, if we
> look at a VM scenario where the guest is following the PCI standard for
> programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> intercept those MMIO writes to the vector table and configure the host
> interrupt domain of the device rather than allowing the guest direct
> access.  This is simply part of virtualizing the device to the guest.
> So even if the kernel allows mmap'ing the vector table, the hypervisor
> needs to trap it, so the mmap isn't required or used anyway.  It's only
> when you define a non-PCI standard for your guest to program
> interrupts, as POWER has done, and can therefore trust that the
> hypervisor does not need to trap on the vector table that having that
> mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> pages too... does ARM have any strategy that would actually make it
> possible to make use of an mmap covering the vector table?  Thanks,

WTF ???? Alex, can you stop once and for all with all that "POWER is
not standard" bullshit please ? It's completely wrong.

This has nothing to do with PCIe standard !

The PCIe standard says strictly *nothing* whatsoever about how an OS
obtains the magic address/values to put in the device and how the PCIe
host bridge may do appropriate fitering.

There is nothing on POWER that prevents the guest from writing the MSI-
X address/data by hand. The problem isn't who writes the values or even
how. The problem breaks down into these two things that are NOT covered
by any aspect of the PCIe standard:

  1- The OS needs to obtain address/data values for an MSI that will
"work" for the device.

  2- The HW+HV needs to prevent collateral damage caused by a device
issuing stores to incorrect address or with incorrect data. Now *this*
is necessary for *ANY* kind of DMA whether it's an MSI or something
else anyway.

Now, the filtering done by qemu is NOT a reasonable way to handle 2)
and whatever excluse about "making it harder" doesn't fly a meter when
it comes to security. Making it "harder to break accidentally" I also
don't buy, people don't just randomly put things in their MSI-X tables
"accidentally", that stuff works or doesn't.

That leaves us with 1). Now this is purely a platform specific matters,
not a spec matter. Once the HW has a way to enforce you can only
generate "allowed" MSIs it becomes a matter of having some FW mechanism
that can be used to informed the OS what address/values to use for a
given interrupts.

This is provided on POWER by a combination of device-tree and RTAS. It
could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
but this is no way a problem of standard compliance, just inferior
firmware interfaces.

So again, for the 234789246th time in years, can we get that 1-bit-of-
information sorted one way or another so we can fix our massive
performance issue instead of adding yet another dozen layers of paint
on that shed ?

Ben.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-16  0:35         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-16  0:35 UTC (permalink / raw)
  To: Alex Williamson, Robin Murphy
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Alexey Kardashevskiy,
	David Woodhouse, Kyle Mahlkuch, Vlad Tsyrklevich,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Kirti Wankhede, Mauricio Faria de Oliveira, Paul Mackerras,
	Bjorn Helgaas, Arvind Yadav, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Gibson

On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> Of course I don't think either of those are worth imposing a
> performance penalty where we don't otherwise need one.  However, if we
> look at a VM scenario where the guest is following the PCI standard for
> programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> intercept those MMIO writes to the vector table and configure the host
> interrupt domain of the device rather than allowing the guest direct
> access.  This is simply part of virtualizing the device to the guest.
> So even if the kernel allows mmap'ing the vector table, the hypervisor
> needs to trap it, so the mmap isn't required or used anyway.  It's only
> when you define a non-PCI standard for your guest to program
> interrupts, as POWER has done, and can therefore trust that the
> hypervisor does not need to trap on the vector table that having that
> mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> pages too... does ARM have any strategy that would actually make it
> possible to make use of an mmap covering the vector table?  Thanks,

WTF ???? Alex, can you stop once and for all with all that "POWER is
not standard" bullshit please ? It's completely wrong.

This has nothing to do with PCIe standard !

The PCIe standard says strictly *nothing* whatsoever about how an OS
obtains the magic address/values to put in the device and how the PCIe
host bridge may do appropriate fitering.

There is nothing on POWER that prevents the guest from writing the MSI-
X address/data by hand. The problem isn't who writes the values or even
how. The problem breaks down into these two things that are NOT covered
by any aspect of the PCIe standard:

  1- The OS needs to obtain address/data values for an MSI that will
"work" for the device.

  2- The HW+HV needs to prevent collateral damage caused by a device
issuing stores to incorrect address or with incorrect data. Now *this*
is necessary for *ANY* kind of DMA whether it's an MSI or something
else anyway.

Now, the filtering done by qemu is NOT a reasonable way to handle 2)
and whatever excluse about "making it harder" doesn't fly a meter when
it comes to security. Making it "harder to break accidentally" I also
don't buy, people don't just randomly put things in their MSI-X tables
"accidentally", that stuff works or doesn't.

That leaves us with 1). Now this is purely a platform specific matters,
not a spec matter. Once the HW has a way to enforce you can only
generate "allowed" MSIs it becomes a matter of having some FW mechanism
that can be used to informed the OS what address/values to use for a
given interrupts.

This is provided on POWER by a combination of device-tree and RTAS. It
could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
but this is no way a problem of standard compliance, just inferior
firmware interfaces.

So again, for the 234789246th time in years, can we get that 1-bit-of-
information sorted one way or another so we can fix our massive
performance issue instead of adding yet another dozen layers of paint
on that shed ?

Ben.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-16  0:35         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-16  0:35 UTC (permalink / raw)
  To: Alex Williamson, Robin Murphy
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc, kvm,
	Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song, Bjorn Helgaas,
	Joerg Roedel, Arvind Yadav, David Woodhouse, Kirti Wankhede,
	Mauricio Faria de Oliveira, Neo Jia, Paul Mackerras,
	Vlad Tsyrklevich, iommu, linux-kernel

On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> Of course I don't think either of those are worth imposing a
> performance penalty where we don't otherwise need one.  However, if we
> look at a VM scenario where the guest is following the PCI standard for
> programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> intercept those MMIO writes to the vector table and configure the host
> interrupt domain of the device rather than allowing the guest direct
> access.  This is simply part of virtualizing the device to the guest.
> So even if the kernel allows mmap'ing the vector table, the hypervisor
> needs to trap it, so the mmap isn't required or used anyway.  It's only
> when you define a non-PCI standard for your guest to program
> interrupts, as POWER has done, and can therefore trust that the
> hypervisor does not need to trap on the vector table that having that
> mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> pages too... does ARM have any strategy that would actually make it
> possible to make use of an mmap covering the vector table?  Thanks,

WTF ???? Alex, can you stop once and for all with all that "POWER is
not standard" bullshit please ? It's completely wrong.

This has nothing to do with PCIe standard !

The PCIe standard says strictly *nothing* whatsoever about how an OS
obtains the magic address/values to put in the device and how the PCIe
host bridge may do appropriate fitering.

There is nothing on POWER that prevents the guest from writing the MSI-
X address/data by hand. The problem isn't who writes the values or even
how. The problem breaks down into these two things that are NOT covered
by any aspect of the PCIe standard:

  1- The OS needs to obtain address/data values for an MSI that will
"work" for the device.

  2- The HW+HV needs to prevent collateral damage caused by a device
issuing stores to incorrect address or with incorrect data. Now *this*
is necessary for *ANY* kind of DMA whether it's an MSI or something
else anyway.

Now, the filtering done by qemu is NOT a reasonable way to handle 2)
and whatever excluse about "making it harder" doesn't fly a meter when
it comes to security. Making it "harder to break accidentally" I also
don't buy, people don't just randomly put things in their MSI-X tables
"accidentally", that stuff works or doesn't.

That leaves us with 1). Now this is purely a platform specific matters,
not a spec matter. Once the HW has a way to enforce you can only
generate "allowed" MSIs it becomes a matter of having some FW mechanism
that can be used to informed the OS what address/values to use for a
given interrupts.

This is provided on POWER by a combination of device-tree and RTAS. It
could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
but this is no way a problem of standard compliance, just inferior
firmware interfaces.

So again, for the 234789246th time in years, can we get that 1-bit-of-
information sorted one way or another so we can fix our massive
performance issue instead of adding yet another dozen layers of paint
on that shed ?

Ben.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-16 16:56           ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-16 16:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav, David Woodhouse,
	Kirti Wankhede, Mauricio Faria de Oliveira, Neo Jia,
	Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Wed, 16 Aug 2017 10:35:49 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> > Of course I don't think either of those are worth imposing a
> > performance penalty where we don't otherwise need one.  However, if we
> > look at a VM scenario where the guest is following the PCI standard for
> > programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> > intercept those MMIO writes to the vector table and configure the host
> > interrupt domain of the device rather than allowing the guest direct
> > access.  This is simply part of virtualizing the device to the guest.
> > So even if the kernel allows mmap'ing the vector table, the hypervisor
> > needs to trap it, so the mmap isn't required or used anyway.  It's only
> > when you define a non-PCI standard for your guest to program
> > interrupts, as POWER has done, and can therefore trust that the
> > hypervisor does not need to trap on the vector table that having that
> > mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> > pages too... does ARM have any strategy that would actually make it
> > possible to make use of an mmap covering the vector table?  Thanks,  
> 
> WTF ???? Alex, can you stop once and for all with all that "POWER is
> not standard" bullshit please ? It's completely wrong.

As you've stated, the MSI-X vector table on POWER is currently updated
via a hypercall.  POWER is overall PCI compliant (I assume), but the
guest does not directly modify the vector table in MMIO space of the
device.  This is important...

> This has nothing to do with PCIe standard !

Yes, it actually does, because if the guest relies on the vector table
to be virtualized then it doesn't particularly matter whether the
vfio-pci kernel driver allows that portion of device MMIO space to be
directly accessed or mapped because QEMU needs for it to be trapped in
order to provide that virtualization.

I'm not knocking POWER, it's a smart thing for virtualization to have
defined this hypercall which negates the need for vector table
virtualization and allows efficient mapping of the device.  On other
platform, it's not necessarily practical given the broad base of legacy
guests supported where we'd never get agreement to implement this as
part of the platform spec... if there even was such a thing.  Maybe we
could provide the hypercall and dynamically enable direct vector table
mapping (disabling vector table virtualization) only if the hypercall
is used.

> The PCIe standard says strictly *nothing* whatsoever about how an OS
> obtains the magic address/values to put in the device and how the PCIe
> host bridge may do appropriate fitering.

And now we've jumped the tracks...  The only way the platform specific
address/data values become important is if we allow direct access to
the vector table AND now we're formulating how the user/guest might
write to it directly.  Otherwise the virtualization of the vector
table, or paravirtualization via hypercall provides the translation
where the host and guest address/data pairs can operate in completely
different address spaces.

> There is nothing on POWER that prevents the guest from writing the MSI-
> X address/data by hand. The problem isn't who writes the values or even
> how. The problem breaks down into these two things that are NOT covered
> by any aspect of the PCIe standard:

You've moved on to a different problem, I think everyone aside from
POWER is still back at the problem where who writes the vector table
values is a forefront problem.
 
>   1- The OS needs to obtain address/data values for an MSI that will
> "work" for the device.
> 
>   2- The HW+HV needs to prevent collateral damage caused by a device
> issuing stores to incorrect address or with incorrect data. Now *this*
> is necessary for *ANY* kind of DMA whether it's an MSI or something
> else anyway.
> 
> Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> and whatever excluse about "making it harder" doesn't fly a meter when
> it comes to security. Making it "harder to break accidentally" I also
> don't buy, people don't just randomly put things in their MSI-X tables
> "accidentally", that stuff works or doesn't.

As I said before, I'm not willing to preserve the weak attributes that
blocking direct vector table access provides over pursuing a more
performant interface, but I also don't think their value is absolute
zero either.

> That leaves us with 1). Now this is purely a platform specific matters,
> not a spec matter. Once the HW has a way to enforce you can only
> generate "allowed" MSIs it becomes a matter of having some FW mechanism
> that can be used to informed the OS what address/values to use for a
> given interrupts.
> 
> This is provided on POWER by a combination of device-tree and RTAS. It
> could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> but this is no way a problem of standard compliance, just inferior
> firmware interfaces.

Firmware pissing match...  Processors running with 8k or less page size
fall within the recommendations of the PCI spec for register alignment
of MMIO regions of the device and this whole problem becomes less of an
issue.

> So again, for the 234789246th time in years, can we get that 1-bit-of-
> information sorted one way or another so we can fix our massive
> performance issue instead of adding yet another dozen layers of paint
> on that shed ?

TBH, I'm not even sure which bikeshed we're looking at with this latest
distraction of interfaces through which the user/guest could discover
viable address/data values to write the vector table directly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-16 16:56           ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-16 16:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Alexey Kardashevskiy, Kyle Mahlkuch,
	Vlad Tsyrklevich, kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Kirti Wankhede, Mauricio Faria de Oliveira, Paul Mackerras,
	Bjorn Helgaas, Arvind Yadav, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	David Woodhouse, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	David Gibson

On Wed, 16 Aug 2017 10:35:49 +1000
Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org> wrote:

> On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> > Of course I don't think either of those are worth imposing a
> > performance penalty where we don't otherwise need one.  However, if we
> > look at a VM scenario where the guest is following the PCI standard for
> > programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> > intercept those MMIO writes to the vector table and configure the host
> > interrupt domain of the device rather than allowing the guest direct
> > access.  This is simply part of virtualizing the device to the guest.
> > So even if the kernel allows mmap'ing the vector table, the hypervisor
> > needs to trap it, so the mmap isn't required or used anyway.  It's only
> > when you define a non-PCI standard for your guest to program
> > interrupts, as POWER has done, and can therefore trust that the
> > hypervisor does not need to trap on the vector table that having that
> > mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> > pages too... does ARM have any strategy that would actually make it
> > possible to make use of an mmap covering the vector table?  Thanks,  
> 
> WTF ???? Alex, can you stop once and for all with all that "POWER is
> not standard" bullshit please ? It's completely wrong.

As you've stated, the MSI-X vector table on POWER is currently updated
via a hypercall.  POWER is overall PCI compliant (I assume), but the
guest does not directly modify the vector table in MMIO space of the
device.  This is important...

> This has nothing to do with PCIe standard !

Yes, it actually does, because if the guest relies on the vector table
to be virtualized then it doesn't particularly matter whether the
vfio-pci kernel driver allows that portion of device MMIO space to be
directly accessed or mapped because QEMU needs for it to be trapped in
order to provide that virtualization.

I'm not knocking POWER, it's a smart thing for virtualization to have
defined this hypercall which negates the need for vector table
virtualization and allows efficient mapping of the device.  On other
platform, it's not necessarily practical given the broad base of legacy
guests supported where we'd never get agreement to implement this as
part of the platform spec... if there even was such a thing.  Maybe we
could provide the hypercall and dynamically enable direct vector table
mapping (disabling vector table virtualization) only if the hypercall
is used.

> The PCIe standard says strictly *nothing* whatsoever about how an OS
> obtains the magic address/values to put in the device and how the PCIe
> host bridge may do appropriate fitering.

And now we've jumped the tracks...  The only way the platform specific
address/data values become important is if we allow direct access to
the vector table AND now we're formulating how the user/guest might
write to it directly.  Otherwise the virtualization of the vector
table, or paravirtualization via hypercall provides the translation
where the host and guest address/data pairs can operate in completely
different address spaces.

> There is nothing on POWER that prevents the guest from writing the MSI-
> X address/data by hand. The problem isn't who writes the values or even
> how. The problem breaks down into these two things that are NOT covered
> by any aspect of the PCIe standard:

You've moved on to a different problem, I think everyone aside from
POWER is still back at the problem where who writes the vector table
values is a forefront problem.
 
>   1- The OS needs to obtain address/data values for an MSI that will
> "work" for the device.
> 
>   2- The HW+HV needs to prevent collateral damage caused by a device
> issuing stores to incorrect address or with incorrect data. Now *this*
> is necessary for *ANY* kind of DMA whether it's an MSI or something
> else anyway.
> 
> Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> and whatever excluse about "making it harder" doesn't fly a meter when
> it comes to security. Making it "harder to break accidentally" I also
> don't buy, people don't just randomly put things in their MSI-X tables
> "accidentally", that stuff works or doesn't.

As I said before, I'm not willing to preserve the weak attributes that
blocking direct vector table access provides over pursuing a more
performant interface, but I also don't think their value is absolute
zero either.

> That leaves us with 1). Now this is purely a platform specific matters,
> not a spec matter. Once the HW has a way to enforce you can only
> generate "allowed" MSIs it becomes a matter of having some FW mechanism
> that can be used to informed the OS what address/values to use for a
> given interrupts.
> 
> This is provided on POWER by a combination of device-tree and RTAS. It
> could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> but this is no way a problem of standard compliance, just inferior
> firmware interfaces.

Firmware pissing match...  Processors running with 8k or less page size
fall within the recommendations of the PCI spec for register alignment
of MMIO regions of the device and this whole problem becomes less of an
issue.

> So again, for the 234789246th time in years, can we get that 1-bit-of-
> information sorted one way or another so we can fix our massive
> performance issue instead of adding yet another dozen layers of paint
> on that shed ?

TBH, I'm not even sure which bikeshed we're looking at with this latest
distraction of interfaces through which the user/guest could discover
viable address/data values to write the vector table directly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-16 16:56           ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-16 16:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav, David Woodhouse,
	Kirti Wankhede, Mauricio Faria de Oliveira, Neo Jia,
	Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Wed, 16 Aug 2017 10:35:49 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> > Of course I don't think either of those are worth imposing a
> > performance penalty where we don't otherwise need one.  However, if we
> > look at a VM scenario where the guest is following the PCI standard for
> > programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> > intercept those MMIO writes to the vector table and configure the host
> > interrupt domain of the device rather than allowing the guest direct
> > access.  This is simply part of virtualizing the device to the guest.
> > So even if the kernel allows mmap'ing the vector table, the hypervisor
> > needs to trap it, so the mmap isn't required or used anyway.  It's only
> > when you define a non-PCI standard for your guest to program
> > interrupts, as POWER has done, and can therefore trust that the
> > hypervisor does not need to trap on the vector table that having that
> > mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> > pages too... does ARM have any strategy that would actually make it
> > possible to make use of an mmap covering the vector table?  Thanks,  
> 
> WTF ???? Alex, can you stop once and for all with all that "POWER is
> not standard" bullshit please ? It's completely wrong.

As you've stated, the MSI-X vector table on POWER is currently updated
via a hypercall.  POWER is overall PCI compliant (I assume), but the
guest does not directly modify the vector table in MMIO space of the
device.  This is important...

> This has nothing to do with PCIe standard !

Yes, it actually does, because if the guest relies on the vector table
to be virtualized then it doesn't particularly matter whether the
vfio-pci kernel driver allows that portion of device MMIO space to be
directly accessed or mapped because QEMU needs for it to be trapped in
order to provide that virtualization.

I'm not knocking POWER, it's a smart thing for virtualization to have
defined this hypercall which negates the need for vector table
virtualization and allows efficient mapping of the device.  On other
platform, it's not necessarily practical given the broad base of legacy
guests supported where we'd never get agreement to implement this as
part of the platform spec... if there even was such a thing.  Maybe we
could provide the hypercall and dynamically enable direct vector table
mapping (disabling vector table virtualization) only if the hypercall
is used.

> The PCIe standard says strictly *nothing* whatsoever about how an OS
> obtains the magic address/values to put in the device and how the PCIe
> host bridge may do appropriate fitering.

And now we've jumped the tracks...  The only way the platform specific
address/data values become important is if we allow direct access to
the vector table AND now we're formulating how the user/guest might
write to it directly.  Otherwise the virtualization of the vector
table, or paravirtualization via hypercall provides the translation
where the host and guest address/data pairs can operate in completely
different address spaces.

> There is nothing on POWER that prevents the guest from writing the MSI-
> X address/data by hand. The problem isn't who writes the values or even
> how. The problem breaks down into these two things that are NOT covered
> by any aspect of the PCIe standard:

You've moved on to a different problem, I think everyone aside from
POWER is still back at the problem where who writes the vector table
values is a forefront problem.
 
>   1- The OS needs to obtain address/data values for an MSI that will
> "work" for the device.
> 
>   2- The HW+HV needs to prevent collateral damage caused by a device
> issuing stores to incorrect address or with incorrect data. Now *this*
> is necessary for *ANY* kind of DMA whether it's an MSI or something
> else anyway.
> 
> Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> and whatever excluse about "making it harder" doesn't fly a meter when
> it comes to security. Making it "harder to break accidentally" I also
> don't buy, people don't just randomly put things in their MSI-X tables
> "accidentally", that stuff works or doesn't.

As I said before, I'm not willing to preserve the weak attributes that
blocking direct vector table access provides over pursuing a more
performant interface, but I also don't think their value is absolute
zero either.

> That leaves us with 1). Now this is purely a platform specific matters,
> not a spec matter. Once the HW has a way to enforce you can only
> generate "allowed" MSIs it becomes a matter of having some FW mechanism
> that can be used to informed the OS what address/values to use for a
> given interrupts.
> 
> This is provided on POWER by a combination of device-tree and RTAS. It
> could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> but this is no way a problem of standard compliance, just inferior
> firmware interfaces.

Firmware pissing match...  Processors running with 8k or less page size
fall within the recommendations of the PCI spec for register alignment
of MMIO regions of the device and this whole problem becomes less of an
issue.

> So again, for the 234789246th time in years, can we get that 1-bit-of-
> information sorted one way or another so we can fix our massive
> performance issue instead of adding yet another dozen layers of paint
> on that shed ?

TBH, I'm not even sure which bikeshed we're looking at with this latest
distraction of interfaces through which the user/guest could discover
viable address/data values to write the vector table directly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-16 16:56           ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-16 16:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav, David Woodhouse,
	Kirti Wankhede, Mauricio Faria de Oliveira, Neo Jia,
	Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Wed, 16 Aug 2017 10:35:49 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> > Of course I don't think either of those are worth imposing a
> > performance penalty where we don't otherwise need one.  However, if we
> > look at a VM scenario where the guest is following the PCI standard for
> > programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> > intercept those MMIO writes to the vector table and configure the host
> > interrupt domain of the device rather than allowing the guest direct
> > access.  This is simply part of virtualizing the device to the guest.
> > So even if the kernel allows mmap'ing the vector table, the hypervisor
> > needs to trap it, so the mmap isn't required or used anyway.  It's only
> > when you define a non-PCI standard for your guest to program
> > interrupts, as POWER has done, and can therefore trust that the
> > hypervisor does not need to trap on the vector table that having that
> > mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> > pages too... does ARM have any strategy that would actually make it
> > possible to make use of an mmap covering the vector table?  Thanks,  
> 
> WTF ???? Alex, can you stop once and for all with all that "POWER is
> not standard" bullshit please ? It's completely wrong.

As you've stated, the MSI-X vector table on POWER is currently updated
via a hypercall.  POWER is overall PCI compliant (I assume), but the
guest does not directly modify the vector table in MMIO space of the
device.  This is important...

> This has nothing to do with PCIe standard !

Yes, it actually does, because if the guest relies on the vector table
to be virtualized then it doesn't particularly matter whether the
vfio-pci kernel driver allows that portion of device MMIO space to be
directly accessed or mapped because QEMU needs for it to be trapped in
order to provide that virtualization.

I'm not knocking POWER, it's a smart thing for virtualization to have
defined this hypercall which negates the need for vector table
virtualization and allows efficient mapping of the device.  On other
platform, it's not necessarily practical given the broad base of legacy
guests supported where we'd never get agreement to implement this as
part of the platform spec... if there even was such a thing.  Maybe we
could provide the hypercall and dynamically enable direct vector table
mapping (disabling vector table virtualization) only if the hypercall
is used.

> The PCIe standard says strictly *nothing* whatsoever about how an OS
> obtains the magic address/values to put in the device and how the PCIe
> host bridge may do appropriate fitering.

And now we've jumped the tracks...  The only way the platform specific
address/data values become important is if we allow direct access to
the vector table AND now we're formulating how the user/guest might
write to it directly.  Otherwise the virtualization of the vector
table, or paravirtualization via hypercall provides the translation
where the host and guest address/data pairs can operate in completely
different address spaces.

> There is nothing on POWER that prevents the guest from writing the MSI-
> X address/data by hand. The problem isn't who writes the values or even
> how. The problem breaks down into these two things that are NOT covered
> by any aspect of the PCIe standard:

You've moved on to a different problem, I think everyone aside from
POWER is still back at the problem where who writes the vector table
values is a forefront problem.
 
>   1- The OS needs to obtain address/data values for an MSI that will
> "work" for the device.
> 
>   2- The HW+HV needs to prevent collateral damage caused by a device
> issuing stores to incorrect address or with incorrect data. Now *this*
> is necessary for *ANY* kind of DMA whether it's an MSI or something
> else anyway.
> 
> Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> and whatever excluse about "making it harder" doesn't fly a meter when
> it comes to security. Making it "harder to break accidentally" I also
> don't buy, people don't just randomly put things in their MSI-X tables
> "accidentally", that stuff works or doesn't.

As I said before, I'm not willing to preserve the weak attributes that
blocking direct vector table access provides over pursuing a more
performant interface, but I also don't think their value is absolute
zero either.

> That leaves us with 1). Now this is purely a platform specific matters,
> not a spec matter. Once the HW has a way to enforce you can only
> generate "allowed" MSIs it becomes a matter of having some FW mechanism
> that can be used to informed the OS what address/values to use for a
> given interrupts.
> 
> This is provided on POWER by a combination of device-tree and RTAS. It
> could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> but this is no way a problem of standard compliance, just inferior
> firmware interfaces.

Firmware pissing match...  Processors running with 8k or less page size
fall within the recommendations of the PCI spec for register alignment
of MMIO regions of the device and this whole problem becomes less of an
issue.

> So again, for the 234789246th time in years, can we get that 1-bit-of-
> information sorted one way or another so we can fix our massive
> performance issue instead of adding yet another dozen layers of paint
> on that shed ?

TBH, I'm not even sure which bikeshed we're looking at with this latest
distraction of interfaces through which the user/guest could discover
viable address/data values to write the vector table directly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-17  4:43             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-17  4:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav, David Woodhouse,
	Kirti Wankhede, Mauricio Faria de Oliveira, Neo Jia,
	Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Wed, 2017-08-16 at 10:56 -0600, Alex Williamson wrote:
> 
> > WTF ???? Alex, can you stop once and for all with all that "POWER is
> > not standard" bullshit please ? It's completely wrong.
> 
> As you've stated, the MSI-X vector table on POWER is currently updated
> via a hypercall.  POWER is overall PCI compliant (I assume), but the
> guest does not directly modify the vector table in MMIO space of the
> device.  This is important...

Well no. On qemu the guest doesn't always (but it can save/restore it),
but on PowerVM this is done by the FW running inside the partition
itself. And that firmware just does normal stores to the device table.

IE. The problem here isn't so much who does the actual stores to the
device table but where they get the address and data values from, which
isn't covered by the spec.

The added fact that qemu hijacks the stores not just to "remap" them
but also do the whole reuqesting of the interrupt etc... in the host
system is a qemu design choice which also hasn't any relation to the
spec (and arguably isnt' a great choice for our systems).

For example, on PowerVM, the HV assigns a pile of MSIs to the guest to
assign to its devices. The FW inside the guest does a default
assignment but that can be changed.

Thus the interrupts are effectively "hooked up" at the HV level at the
point where the PCI bridge is mapped into the guest.

> > This has nothing to do with PCIe standard !
> 
> Yes, it actually does, because if the guest relies on the vector table
> to be virtualized then it doesn't particularly matter whether the
> vfio-pci kernel driver allows that portion of device MMIO space to be
> directly accessed or mapped because QEMU needs for it to be trapped in
> order to provide that virtualization.

And this has nothing to do with the PCIe standard... this has
everything to do with a combination of qemu design choices and
defficient FW interfaces on x86 platforms.

> I'm not knocking POWER, it's a smart thing for virtualization to have
> defined this hypercall which negates the need for vector table
> virtualization and allows efficient mapping of the device.  On other
> platform, it's not necessarily practical given the broad base of legacy
> guests supported where we'd never get agreement to implement this as
> part of the platform spec... if there even was such a thing.  Maybe we
> could provide the hypercall and dynamically enable direct vector table
> mapping (disabling vector table virtualization) only if the hypercall
> is used.

No I think a better approach would be to provide the guest with a pile
of MSIs to use with devices and have FW (such as ACPI) tell the guest
about them.

> > The PCIe standard says strictly *nothing* whatsoever about how an OS
> > obtains the magic address/values to put in the device and how the PCIe
> > host bridge may do appropriate fitering.
> 
> And now we've jumped the tracks...  The only way the platform specific
> address/data values become important is if we allow direct access to
> the vector table AND now we're formulating how the user/guest might
> write to it directly.  Otherwise the virtualization of the vector
> table, or paravirtualization via hypercall provides the translation
> where the host and guest address/data pairs can operate in completely
> different address spaces.

They can regardless if things are done properly :-)

> > There is nothing on POWER that prevents the guest from writing the MSI-
> > X address/data by hand. The problem isn't who writes the values or even
> > how. The problem breaks down into these two things that are NOT covered
> > by any aspect of the PCIe standard:
> 
> You've moved on to a different problem, I think everyone aside from
> POWER is still back at the problem where who writes the vector table
> values is a forefront problem.
>  
> >   1- The OS needs to obtain address/data values for an MSI that will
> > "work" for the device.
> > 
> >   2- The HW+HV needs to prevent collateral damage caused by a device
> > issuing stores to incorrect address or with incorrect data. Now *this*
> > is necessary for *ANY* kind of DMA whether it's an MSI or something
> > else anyway.
> > 
> > Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> > and whatever excluse about "making it harder" doesn't fly a meter when
> > it comes to security. Making it "harder to break accidentally" I also
> > don't buy, people don't just randomly put things in their MSI-X tables
> > "accidentally", that stuff works or doesn't.
> 
> As I said before, I'm not willing to preserve the weak attributes that
> blocking direct vector table access provides over pursuing a more
> performant interface, but I also don't think their value is absolute
> zero either.
> 
> > That leaves us with 1). Now this is purely a platform specific matters,
> > not a spec matter. Once the HW has a way to enforce you can only
> > generate "allowed" MSIs it becomes a matter of having some FW mechanism
> > that can be used to informed the OS what address/values to use for a
> > given interrupts.
> > 
> > This is provided on POWER by a combination of device-tree and RTAS. It
> > could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> > but this is no way a problem of standard compliance, just inferior
> > firmware interfaces.
> 
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.
> 
> > So again, for the 234789246th time in years, can we get that 1-bit-of-
> > information sorted one way or another so we can fix our massive
> > performance issue instead of adding yet another dozen layers of paint
> > on that shed ?
> 
> TBH, I'm not even sure which bikeshed we're looking at with this latest
> distraction of interfaces through which the user/guest could discover
> viable address/data values to write the vector table directly.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-17  4:43             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-17  4:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Alexey Kardashevskiy, Kyle Mahlkuch,
	Vlad Tsyrklevich, kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Kirti Wankhede, Mauricio Faria de Oliveira, Paul Mackerras,
	Bjorn Helgaas, Arvind Yadav, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	David Woodhouse, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	David Gibson

On Wed, 2017-08-16 at 10:56 -0600, Alex Williamson wrote:
> 
> > WTF ???? Alex, can you stop once and for all with all that "POWER is
> > not standard" bullshit please ? It's completely wrong.
> 
> As you've stated, the MSI-X vector table on POWER is currently updated
> via a hypercall.  POWER is overall PCI compliant (I assume), but the
> guest does not directly modify the vector table in MMIO space of the
> device.  This is important...

Well no. On qemu the guest doesn't always (but it can save/restore it),
but on PowerVM this is done by the FW running inside the partition
itself. And that firmware just does normal stores to the device table.

IE. The problem here isn't so much who does the actual stores to the
device table but where they get the address and data values from, which
isn't covered by the spec.

The added fact that qemu hijacks the stores not just to "remap" them
but also do the whole reuqesting of the interrupt etc... in the host
system is a qemu design choice which also hasn't any relation to the
spec (and arguably isnt' a great choice for our systems).

For example, on PowerVM, the HV assigns a pile of MSIs to the guest to
assign to its devices. The FW inside the guest does a default
assignment but that can be changed.

Thus the interrupts are effectively "hooked up" at the HV level at the
point where the PCI bridge is mapped into the guest.

> > This has nothing to do with PCIe standard !
> 
> Yes, it actually does, because if the guest relies on the vector table
> to be virtualized then it doesn't particularly matter whether the
> vfio-pci kernel driver allows that portion of device MMIO space to be
> directly accessed or mapped because QEMU needs for it to be trapped in
> order to provide that virtualization.

And this has nothing to do with the PCIe standard... this has
everything to do with a combination of qemu design choices and
defficient FW interfaces on x86 platforms.

> I'm not knocking POWER, it's a smart thing for virtualization to have
> defined this hypercall which negates the need for vector table
> virtualization and allows efficient mapping of the device.  On other
> platform, it's not necessarily practical given the broad base of legacy
> guests supported where we'd never get agreement to implement this as
> part of the platform spec... if there even was such a thing.  Maybe we
> could provide the hypercall and dynamically enable direct vector table
> mapping (disabling vector table virtualization) only if the hypercall
> is used.

No I think a better approach would be to provide the guest with a pile
of MSIs to use with devices and have FW (such as ACPI) tell the guest
about them.

> > The PCIe standard says strictly *nothing* whatsoever about how an OS
> > obtains the magic address/values to put in the device and how the PCIe
> > host bridge may do appropriate fitering.
> 
> And now we've jumped the tracks...  The only way the platform specific
> address/data values become important is if we allow direct access to
> the vector table AND now we're formulating how the user/guest might
> write to it directly.  Otherwise the virtualization of the vector
> table, or paravirtualization via hypercall provides the translation
> where the host and guest address/data pairs can operate in completely
> different address spaces.

They can regardless if things are done properly :-)

> > There is nothing on POWER that prevents the guest from writing the MSI-
> > X address/data by hand. The problem isn't who writes the values or even
> > how. The problem breaks down into these two things that are NOT covered
> > by any aspect of the PCIe standard:
> 
> You've moved on to a different problem, I think everyone aside from
> POWER is still back at the problem where who writes the vector table
> values is a forefront problem.
>  
> >   1- The OS needs to obtain address/data values for an MSI that will
> > "work" for the device.
> > 
> >   2- The HW+HV needs to prevent collateral damage caused by a device
> > issuing stores to incorrect address or with incorrect data. Now *this*
> > is necessary for *ANY* kind of DMA whether it's an MSI or something
> > else anyway.
> > 
> > Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> > and whatever excluse about "making it harder" doesn't fly a meter when
> > it comes to security. Making it "harder to break accidentally" I also
> > don't buy, people don't just randomly put things in their MSI-X tables
> > "accidentally", that stuff works or doesn't.
> 
> As I said before, I'm not willing to preserve the weak attributes that
> blocking direct vector table access provides over pursuing a more
> performant interface, but I also don't think their value is absolute
> zero either.
> 
> > That leaves us with 1). Now this is purely a platform specific matters,
> > not a spec matter. Once the HW has a way to enforce you can only
> > generate "allowed" MSIs it becomes a matter of having some FW mechanism
> > that can be used to informed the OS what address/values to use for a
> > given interrupts.
> > 
> > This is provided on POWER by a combination of device-tree and RTAS. It
> > could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> > but this is no way a problem of standard compliance, just inferior
> > firmware interfaces.
> 
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.
> 
> > So again, for the 234789246th time in years, can we get that 1-bit-of-
> > information sorted one way or another so we can fix our massive
> > performance issue instead of adding yet another dozen layers of paint
> > on that shed ?
> 
> TBH, I'm not even sure which bikeshed we're looking at with this latest
> distraction of interfaces through which the user/guest could discover
> viable address/data values to write the vector table directly.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-17  4:43             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 82+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-17  4:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Robin Murphy, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	kvm-ppc, kvm, Yongji Xie, Eric Auger, Kyle Mahlkuch, Jike Song,
	Bjorn Helgaas, Joerg Roedel, Arvind Yadav, David Woodhouse,
	Kirti Wankhede, Mauricio Faria de Oliveira, Neo Jia,
	Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On Wed, 2017-08-16 at 10:56 -0600, Alex Williamson wrote:
> 
> > WTF ???? Alex, can you stop once and for all with all that "POWER is
> > not standard" bullshit please ? It's completely wrong.
> 
> As you've stated, the MSI-X vector table on POWER is currently updated
> via a hypercall.  POWER is overall PCI compliant (I assume), but the
> guest does not directly modify the vector table in MMIO space of the
> device.  This is important...

Well no. On qemu the guest doesn't always (but it can save/restore it),
but on PowerVM this is done by the FW running inside the partition
itself. And that firmware just does normal stores to the device table.

IE. The problem here isn't so much who does the actual stores to the
device table but where they get the address and data values from, which
isn't covered by the spec.

The added fact that qemu hijacks the stores not just to "remap" them
but also do the whole reuqesting of the interrupt etc... in the host
system is a qemu design choice which also hasn't any relation to the
spec (and arguably isnt' a great choice for our systems).

For example, on PowerVM, the HV assigns a pile of MSIs to the guest to
assign to its devices. The FW inside the guest does a default
assignment but that can be changed.

Thus the interrupts are effectively "hooked up" at the HV level at the
point where the PCI bridge is mapped into the guest.

> > This has nothing to do with PCIe standard !
> 
> Yes, it actually does, because if the guest relies on the vector table
> to be virtualized then it doesn't particularly matter whether the
> vfio-pci kernel driver allows that portion of device MMIO space to be
> directly accessed or mapped because QEMU needs for it to be trapped in
> order to provide that virtualization.

And this has nothing to do with the PCIe standard... this has
everything to do with a combination of qemu design choices and
defficient FW interfaces on x86 platforms.

> I'm not knocking POWER, it's a smart thing for virtualization to have
> defined this hypercall which negates the need for vector table
> virtualization and allows efficient mapping of the device.  On other
> platform, it's not necessarily practical given the broad base of legacy
> guests supported where we'd never get agreement to implement this as
> part of the platform spec... if there even was such a thing.  Maybe we
> could provide the hypercall and dynamically enable direct vector table
> mapping (disabling vector table virtualization) only if the hypercall
> is used.

No I think a better approach would be to provide the guest with a pile
of MSIs to use with devices and have FW (such as ACPI) tell the guest
about them.

> > The PCIe standard says strictly *nothing* whatsoever about how an OS
> > obtains the magic address/values to put in the device and how the PCIe
> > host bridge may do appropriate fitering.
> 
> And now we've jumped the tracks...  The only way the platform specific
> address/data values become important is if we allow direct access to
> the vector table AND now we're formulating how the user/guest might
> write to it directly.  Otherwise the virtualization of the vector
> table, or paravirtualization via hypercall provides the translation
> where the host and guest address/data pairs can operate in completely
> different address spaces.

They can regardless if things are done properly :-)

> > There is nothing on POWER that prevents the guest from writing the MSI-
> > X address/data by hand. The problem isn't who writes the values or even
> > how. The problem breaks down into these two things that are NOT covered
> > by any aspect of the PCIe standard:
> 
> You've moved on to a different problem, I think everyone aside from
> POWER is still back at the problem where who writes the vector table
> values is a forefront problem.
>  
> >   1- The OS needs to obtain address/data values for an MSI that will
> > "work" for the device.
> > 
> >   2- The HW+HV needs to prevent collateral damage caused by a device
> > issuing stores to incorrect address or with incorrect data. Now *this*
> > is necessary for *ANY* kind of DMA whether it's an MSI or something
> > else anyway.
> > 
> > Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> > and whatever excluse about "making it harder" doesn't fly a meter when
> > it comes to security. Making it "harder to break accidentally" I also
> > don't buy, people don't just randomly put things in their MSI-X tables
> > "accidentally", that stuff works or doesn't.
> 
> As I said before, I'm not willing to preserve the weak attributes that
> blocking direct vector table access provides over pursuing a more
> performant interface, but I also don't think their value is absolute
> zero either.
> 
> > That leaves us with 1). Now this is purely a platform specific matters,
> > not a spec matter. Once the HW has a way to enforce you can only
> > generate "allowed" MSIs it becomes a matter of having some FW mechanism
> > that can be used to informed the OS what address/values to use for a
> > given interrupts.
> > 
> > This is provided on POWER by a combination of device-tree and RTAS. It
> > could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> > but this is no way a problem of standard compliance, just inferior
> > firmware interfaces.
> 
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.
> 
> > So again, for the 234789246th time in years, can we get that 1-bit-of-
> > information sorted one way or another so we can fix our massive
> > performance issue instead of adding yet another dozen layers of paint
> > on that shed ?
> 
> TBH, I'm not even sure which bikeshed we're looking at with this latest
> distraction of interfaces through which the user/guest could discover
> viable address/data values to write the vector table directly.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
  2017-08-16 16:56           ` Alex Williamson
  (?)
@ 2017-08-17 10:56             ` David Laight
  -1 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2017-08-17 10:56 UTC (permalink / raw)
  To: 'Alex Williamson', Benjamin Herrenschmidt
  Cc: Jike Song, Neo Jia, kvm, Eric Auger, Alexey Kardashevskiy,
	Robin Murphy, Joerg Roedel, Kyle Mahlkuch, Vlad Tsyrklevich,
	kvm-ppc, iommu, Yongji Xie, Kirti Wankhede,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev, David Woodhouse, linux-kernel,
	David Gibson

From: Alex Williamson
> Sent: 16 August 2017 17:56
...
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.

Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
it just lie to the guest about the physical address of the MSI-X table?
Then mmio access to anything in the same physical page will just work.

It has already been pointed out that you can't actually police the
interrupts that are raised without host hardware support.

Actually, putting other vectors in the MSI-X table is boring, most
drivers will ignore unexpected interrupts.
Much more interesting are physical memory addresses and accessible IO
addresses.
Of course, a lot of boards have PCI master capability and can probably
be persuaded to do writes to specific location anyway.

	David

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-17 10:56             ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2017-08-17 10:56 UTC (permalink / raw)
  To: 'Alex Williamson', Benjamin Herrenschmidt
  Cc: Jike Song, Neo Jia, kvm, Eric Auger, Alexey Kardashevskiy,
	Robin Murphy, Joerg Roedel, Kyle Mahlkuch, Vlad Tsyrklevich,
	kvm-ppc, iommu, Yongji Xie, Kirti Wankhede,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev

From: Alex Williamson
> Sent: 16 August 2017 17:56
...
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.

Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
it just lie to the guest about the physical address of the MSI-X table?
Then mmio access to anything in the same physical page will just work.

It has already been pointed out that you can't actually police the
interrupts that are raised without host hardware support.

Actually, putting other vectors in the MSI-X table is boring, most
drivers will ignore unexpected interrupts.
Much more interesting are physical memory addresses and accessible IO
addresses.
Of course, a lot of boards have PCI master capability and can probably
be persuaded to do writes to specific location anyway.

	David

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-17 10:56             ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2017-08-17 10:56 UTC (permalink / raw)
  To: 'Alex Williamson', Benjamin Herrenschmidt
  Cc: Jike Song, Neo Jia, kvm, Eric Auger, Alexey Kardashevskiy,
	Robin Murphy, Joerg Roedel, Kyle Mahlkuch, Vlad Tsyrklevich,
	kvm-ppc, iommu, Yongji Xie, Kirti Wankhede,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev

From: Alex Williamson
> Sent: 16 August 2017 17:56
...
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.

Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
it just lie to the guest about the physical address of the MSI-X table?
Then mmio access to anything in the same physical page will just work.

It has already been pointed out that you can't actually police the
interrupts that are raised without host hardware support.

Actually, putting other vectors in the MSI-X table is boring, most
drivers will ignore unexpected interrupts.
Much more interesting are physical memory addresses and accessible IO
addresses.
Of course, a lot of boards have PCI master capability and can probably
be persuaded to do writes to specific location anyway.

	David

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-17 19:25               ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-17 19:25 UTC (permalink / raw)
  To: David Laight
  Cc: Benjamin Herrenschmidt, Jike Song, Neo Jia, kvm, Eric Auger,
	Alexey Kardashevskiy, Robin Murphy, Joerg Roedel, Kyle Mahlkuch,
	Vlad Tsyrklevich, kvm-ppc, iommu, Yongji Xie, Kirti Wankhede,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev, David Woodhouse, linux-kernel,
	David Gibson

On Thu, 17 Aug 2017 10:56:35 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Alex Williamson
> > Sent: 16 August 2017 17:56  
> ...
> > Firmware pissing match...  Processors running with 8k or less page size
> > fall within the recommendations of the PCI spec for register alignment
> > of MMIO regions of the device and this whole problem becomes less of an
> > issue.  
> 
> Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
> it just lie to the guest about the physical address of the MSI-X table?
> Then mmio access to anything in the same physical page will just work.

That's an interesting idea, but now you need to add a BAR for the
virtualized vector table, but you'll also need to support extending a
BAR because there won't necessarily be a BAR available to add.  Of
course PCI requires natural alignment of BARs, thus an extra few bytes
on the end doubles the BAR size.  So also hope that if we need to
extend a BAR that there's a relatively small one available.  In either
case you're changing the layout of the device from what the driver might
expect.  We try pretty hard with device assignment to leave things in
the same place as they appear on bare metal, perhaps removing things,
but not actually moving things.  It might work in the majority of
cases, but it seems a bit precarious overall.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-17 19:25               ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-17 19:25 UTC (permalink / raw)
  To: David Laight
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Alexey Kardashevskiy,
	Benjamin Herrenschmidt, Kyle Mahlkuch, Kirti Wankhede,
	Paul Mackerras, Arvind Yadav, David Woodhouse,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA, Yongji Xie, Bjorn Helgaas,
	David Gibson, Vlad Tsyrklevich,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mauricio Faria de Oliveira, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

On Thu, 17 Aug 2017 10:56:35 +0000
David Laight <David.Laight-ZS65k/vG3HxXrIkS9f7CXA@public.gmane.org> wrote:

> From: Alex Williamson
> > Sent: 16 August 2017 17:56  
> ...
> > Firmware pissing match...  Processors running with 8k or less page size
> > fall within the recommendations of the PCI spec for register alignment
> > of MMIO regions of the device and this whole problem becomes less of an
> > issue.  
> 
> Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
> it just lie to the guest about the physical address of the MSI-X table?
> Then mmio access to anything in the same physical page will just work.

That's an interesting idea, but now you need to add a BAR for the
virtualized vector table, but you'll also need to support extending a
BAR because there won't necessarily be a BAR available to add.  Of
course PCI requires natural alignment of BARs, thus an extra few bytes
on the end doubles the BAR size.  So also hope that if we need to
extend a BAR that there's a relatively small one available.  In either
case you're changing the layout of the device from what the driver might
expect.  We try pretty hard with device assignment to leave things in
the same place as they appear on bare metal, perhaps removing things,
but not actually moving things.  It might work in the majority of
cases, but it seems a bit precarious overall.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-17 19:25               ` Alex Williamson
  0 siblings, 0 replies; 82+ messages in thread
From: Alex Williamson @ 2017-08-17 19:25 UTC (permalink / raw)
  To: David Laight
  Cc: Benjamin Herrenschmidt, Jike Song, Neo Jia, kvm, Eric Auger,
	Alexey Kardashevskiy, Robin Murphy, Joerg Roedel, Kyle Mahlkuch,
	Vlad Tsyrklevich, kvm-ppc, iommu, Yongji Xie, Kirti Wankhede,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, linuxppc-dev, David Woodhouse, linux-kernel,
	David Gibson

On Thu, 17 Aug 2017 10:56:35 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Alex Williamson
> > Sent: 16 August 2017 17:56  
> ...
> > Firmware pissing match...  Processors running with 8k or less page size
> > fall within the recommendations of the PCI spec for register alignment
> > of MMIO regions of the device and this whole problem becomes less of an
> > issue.  
> 
> Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
> it just lie to the guest about the physical address of the MSI-X table?
> Then mmio access to anything in the same physical page will just work.

That's an interesting idea, but now you need to add a BAR for the
virtualized vector table, but you'll also need to support extending a
BAR because there won't necessarily be a BAR available to add.  Of
course PCI requires natural alignment of BARs, thus an extra few bytes
on the end doubles the BAR size.  So also hope that if we need to
extend a BAR that there's a relatively small one available.  In either
case you're changing the layout of the device from what the driver might
expect.  We try pretty hard with device assignment to leave things in
the same place as they appear on bare metal, perhaps removing things,
but not actually moving things.  It might work in the majority of
cases, but it seems a bit precarious overall.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-21  2:47     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-21  2:47 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

Folks,

Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
repost this or go back to PCI bus flags or something else? Thanks.



On 14/08/17 19:45, Alexey Kardashevskiy wrote:
> Folks,
> 
> Is there anything to change besides those compiler errors and David's
> comment in 5/5? Or the while patchset is too bad? Thanks.
> 
> 
> 
> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>> http://www.spinics.net/lists/kvm/msg152232.html
>>
>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>> bus flags or IOMMU domains are still better (and which one).
> 
>>
>>
>>
>> Here is some background:
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>>
>> This is based on sha1
>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>
>> Please comment. Thanks.
>>
>> Changelog:
>>
>> v5:
>> * redid the whole thing via so-called IOMMU group capabilities
>>
>> v4:
>> * rebased on recent upstream
>> * got all 6 patches from v2 (v3 was missing some)
>>
>>
>>
>>
>> Alexey Kardashevskiy (5):
>>   iommu: Add capabilities to a group
>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>     remapping
>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>     enabled
>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>
>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>  include/linux/vfio.h             |  1 +
>>  arch/powerpc/kernel/iommu.c      |  1 +
>>  drivers/iommu/amd_iommu.c        |  3 +++
>>  drivers/iommu/intel-iommu.c      |  3 +++
>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-21  2:47     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-21  2:47 UTC (permalink / raw)
  To: linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA,
	Benjamin Herrenschmidt, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, David Woodhouse,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Gibson

Folks,

Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
repost this or go back to PCI bus flags or something else? Thanks.



On 14/08/17 19:45, Alexey Kardashevskiy wrote:
> Folks,
> 
> Is there anything to change besides those compiler errors and David's
> comment in 5/5? Or the while patchset is too bad? Thanks.
> 
> 
> 
> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>> http://www.spinics.net/lists/kvm/msg152232.html
>>
>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>> bus flags or IOMMU domains are still better (and which one).
> 
>>
>>
>>
>> Here is some background:
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>>
>> This is based on sha1
>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>
>> Please comment. Thanks.
>>
>> Changelog:
>>
>> v5:
>> * redid the whole thing via so-called IOMMU group capabilities
>>
>> v4:
>> * rebased on recent upstream
>> * got all 6 patches from v2 (v3 was missing some)
>>
>>
>>
>>
>> Alexey Kardashevskiy (5):
>>   iommu: Add capabilities to a group
>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>     remapping
>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>     enabled
>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>
>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>  include/linux/vfio.h             |  1 +
>>  arch/powerpc/kernel/iommu.c      |  1 +
>>  drivers/iommu/amd_iommu.c        |  3 +++
>>  drivers/iommu/intel-iommu.c      |  3 +++
>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-21  2:47     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-21  2:47 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

Folks,

Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
repost this or go back to PCI bus flags or something else? Thanks.



On 14/08/17 19:45, Alexey Kardashevskiy wrote:
> Folks,
> 
> Is there anything to change besides those compiler errors and David's
> comment in 5/5? Or the while patchset is too bad? Thanks.
> 
> 
> 
> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>> http://www.spinics.net/lists/kvm/msg152232.html
>>
>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>> bus flags or IOMMU domains are still better (and which one).
> 
>>
>>
>>
>> Here is some background:
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>>
>> This is based on sha1
>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>
>> Please comment. Thanks.
>>
>> Changelog:
>>
>> v5:
>> * redid the whole thing via so-called IOMMU group capabilities
>>
>> v4:
>> * rebased on recent upstream
>> * got all 6 patches from v2 (v3 was missing some)
>>
>>
>>
>>
>> Alexey Kardashevskiy (5):
>>   iommu: Add capabilities to a group
>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>     remapping
>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>     enabled
>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>
>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>  include/linux/vfio.h             |  1 +
>>  arch/powerpc/kernel/iommu.c      |  1 +
>>  drivers/iommu/amd_iommu.c        |  3 +++
>>  drivers/iommu/intel-iommu.c      |  3 +++
>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
  2017-08-21  2:47     ` Alexey Kardashevskiy
  (?)
@ 2017-08-29  2:58       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-29  2:58 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On 21/08/17 12:47, Alexey Kardashevskiy wrote:
> Folks,
> 
> Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
> repost this or go back to PCI bus flags or something else? Thanks.


Anyone, any help? How do we proceed with this? Thanks.



> 
> 
> 
> On 14/08/17 19:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
>>
>>
>>
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>>   iommu: Add capabilities to a group
>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>     remapping
>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>     enabled
>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>  include/linux/vfio.h             |  1 +
>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-29  2:58       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-29  2:58 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kerne

On 21/08/17 12:47, Alexey Kardashevskiy wrote:
> Folks,
> 
> Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
> repost this or go back to PCI bus flags or something else? Thanks.


Anyone, any help? How do we proceed with this? Thanks.



> 
> 
> 
> On 14/08/17 19:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
>>
>>
>>
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>>   iommu: Add capabilities to a group
>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>     remapping
>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>     enabled
>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>  include/linux/vfio.h             |  1 +
>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-08-29  2:58       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-08-29  2:58 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On 21/08/17 12:47, Alexey Kardashevskiy wrote:
> Folks,
> 
> Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
> repost this or go back to PCI bus flags or something else? Thanks.


Anyone, any help? How do we proceed with this? Thanks.



> 
> 
> 
> On 14/08/17 19:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
>>
>>
>>
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>>   iommu: Add capabilities to a group
>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>     remapping
>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>     enabled
>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>  include/linux/vfio.h             |  1 +
>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-09-11  3:27         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-09-11  3:27 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On 29/08/17 12:58, Alexey Kardashevskiy wrote:
> On 21/08/17 12:47, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
>> repost this or go back to PCI bus flags or something else? Thanks.
> 
> 
> Anyone, any help? How do we proceed with this? Thanks.

Ping, anyone?

Or everybody is just waiting for the kvm forum to discuss it in person? Thanks.



> 
> 
> 
>>
>>
>>
>> On 14/08/17 19:45, Alexey Kardashevskiy wrote:
>>> Folks,
>>>
>>> Is there anything to change besides those compiler errors and David's
>>> comment in 5/5? Or the while patchset is too bad? Thanks.
>>>
>>>
>>>
>>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>>
>>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>>> bus flags or IOMMU domains are still better (and which one).
>>>
>>>>
>>>>
>>>>
>>>> Here is some background:
>>>>
>>>> Current vfio-pci implementation disallows to mmap the page
>>>> containing MSI-X table in case that users can write directly
>>>> to MSI-X table and generate an incorrect MSIs.
>>>>
>>>> However, this will cause some performance issue when there
>>>> are some critical device registers in the same page as the
>>>> MSI-X table. We have to handle the mmio access to these
>>>> registers in QEMU emulation rather than in guest.
>>>>
>>>> To solve this issue, this series allows to expose MSI-X table
>>>> to userspace when hardware enables the capability of interrupt
>>>> remapping which can ensure that a given PCI device can only
>>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>>> for different archs.
>>>>
>>>>
>>>> This is based on sha1
>>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>>
>>>> Please comment. Thanks.
>>>>
>>>> Changelog:
>>>>
>>>> v5:
>>>> * redid the whole thing via so-called IOMMU group capabilities
>>>>
>>>> v4:
>>>> * rebased on recent upstream
>>>> * got all 6 patches from v2 (v3 was missing some)
>>>>
>>>>
>>>>
>>>>
>>>> Alexey Kardashevskiy (5):
>>>>   iommu: Add capabilities to a group
>>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>>     remapping
>>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>>     enabled
>>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>>
>>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>>  include/linux/vfio.h             |  1 +
>>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>>
>>>
>>>
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-09-11  3:27         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-09-11  3:27 UTC (permalink / raw)
  To: linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ
  Cc: Vlad Tsyrklevich, kvm-u79uwXL29TY76Z2rM5mHXA,
	Benjamin Herrenschmidt, Kyle Mahlkuch, Kirti Wankhede,
	kvm-ppc-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Yongji Xie,
	Mauricio Faria de Oliveira, Paul Mackerras, Bjorn Helgaas,
	Arvind Yadav, David Woodhouse,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Gibson

On 29/08/17 12:58, Alexey Kardashevskiy wrote:
> On 21/08/17 12:47, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
>> repost this or go back to PCI bus flags or something else? Thanks.
> 
> 
> Anyone, any help? How do we proceed with this? Thanks.

Ping, anyone?

Or everybody is just waiting for the kvm forum to discuss it in person? Thanks.



> 
> 
> 
>>
>>
>>
>> On 14/08/17 19:45, Alexey Kardashevskiy wrote:
>>> Folks,
>>>
>>> Is there anything to change besides those compiler errors and David's
>>> comment in 5/5? Or the while patchset is too bad? Thanks.
>>>
>>>
>>>
>>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>>
>>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>>> bus flags or IOMMU domains are still better (and which one).
>>>
>>>>
>>>>
>>>>
>>>> Here is some background:
>>>>
>>>> Current vfio-pci implementation disallows to mmap the page
>>>> containing MSI-X table in case that users can write directly
>>>> to MSI-X table and generate an incorrect MSIs.
>>>>
>>>> However, this will cause some performance issue when there
>>>> are some critical device registers in the same page as the
>>>> MSI-X table. We have to handle the mmio access to these
>>>> registers in QEMU emulation rather than in guest.
>>>>
>>>> To solve this issue, this series allows to expose MSI-X table
>>>> to userspace when hardware enables the capability of interrupt
>>>> remapping which can ensure that a given PCI device can only
>>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>>> for different archs.
>>>>
>>>>
>>>> This is based on sha1
>>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>>
>>>> Please comment. Thanks.
>>>>
>>>> Changelog:
>>>>
>>>> v5:
>>>> * redid the whole thing via so-called IOMMU group capabilities
>>>>
>>>> v4:
>>>> * rebased on recent upstream
>>>> * got all 6 patches from v2 (v3 was missing some)
>>>>
>>>>
>>>>
>>>>
>>>> Alexey Kardashevskiy (5):
>>>>   iommu: Add capabilities to a group
>>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>>     remapping
>>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>>     enabled
>>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>>
>>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>>  include/linux/vfio.h             |  1 +
>>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>>
>>>
>>>
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table
@ 2017-09-11  3:27         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 82+ messages in thread
From: Alexey Kardashevskiy @ 2017-09-11  3:27 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Yongji Xie, Eric Auger,
	Kyle Mahlkuch, Alex Williamson, Jike Song, Bjorn Helgaas,
	Robin Murphy, Joerg Roedel, Arvind Yadav, Benjamin Herrenschmidt,
	David Woodhouse, Kirti Wankhede, Mauricio Faria de Oliveira,
	Neo Jia, Paul Mackerras, Vlad Tsyrklevich, iommu, linux-kernel

On 29/08/17 12:58, Alexey Kardashevskiy wrote:
> On 21/08/17 12:47, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
>> repost this or go back to PCI bus flags or something else? Thanks.
> 
> 
> Anyone, any help? How do we proceed with this? Thanks.

Ping, anyone?

Or everybody is just waiting for the kvm forum to discuss it in person? Thanks.



> 
> 
> 
>>
>>
>>
>> On 14/08/17 19:45, Alexey Kardashevskiy wrote:
>>> Folks,
>>>
>>> Is there anything to change besides those compiler errors and David's
>>> comment in 5/5? Or the while patchset is too bad? Thanks.
>>>
>>>
>>>
>>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>>
>>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>>> bus flags or IOMMU domains are still better (and which one).
>>>
>>>>
>>>>
>>>>
>>>> Here is some background:
>>>>
>>>> Current vfio-pci implementation disallows to mmap the page
>>>> containing MSI-X table in case that users can write directly
>>>> to MSI-X table and generate an incorrect MSIs.
>>>>
>>>> However, this will cause some performance issue when there
>>>> are some critical device registers in the same page as the
>>>> MSI-X table. We have to handle the mmio access to these
>>>> registers in QEMU emulation rather than in guest.
>>>>
>>>> To solve this issue, this series allows to expose MSI-X table
>>>> to userspace when hardware enables the capability of interrupt
>>>> remapping which can ensure that a given PCI device can only
>>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>>> for different archs.
>>>>
>>>>
>>>> This is based on sha1
>>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>>
>>>> Please comment. Thanks.
>>>>
>>>> Changelog:
>>>>
>>>> v5:
>>>> * redid the whole thing via so-called IOMMU group capabilities
>>>>
>>>> v4:
>>>> * rebased on recent upstream
>>>> * got all 6 patches from v2 (v3 was missing some)
>>>>
>>>>
>>>>
>>>>
>>>> Alexey Kardashevskiy (5):
>>>>   iommu: Add capabilities to a group
>>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>>     remapping
>>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>>     enabled
>>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>>
>>>>  include/linux/iommu.h            | 20 ++++++++++++++++++++
>>>>  include/linux/vfio.h             |  1 +
>>>>  arch/powerpc/kernel/iommu.c      |  1 +
>>>>  drivers/iommu/amd_iommu.c        |  3 +++
>>>>  drivers/iommu/intel-iommu.c      |  3 +++
>>>>  drivers/iommu/iommu.c            | 35 +++++++++++++++++++++++++++++++++++
>>>>  drivers/vfio/pci/vfio_pci.c      | 20 +++++++++++++++++---
>>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 ++++-
>>>>  drivers/vfio/vfio.c              | 15 +++++++++++++++
>>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>>
>>>
>>>
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2017-09-11  3:27 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-07  7:25 [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table Alexey Kardashevskiy
2017-08-07  7:25 ` Alexey Kardashevskiy
2017-08-07  7:25 ` Alexey Kardashevskiy
2017-08-07  7:25 ` Alexey Kardashevskiy
2017-08-07  7:25 ` [RFC PATCH v5 1/5] iommu: Add capabilities to a group Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-09  5:55   ` David Gibson
2017-08-09  5:55     ` David Gibson
2017-08-09  5:55     ` David Gibson
2017-08-07  7:25 ` [RFC PATCH v5 2/5] iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ remapping Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25 ` [RFC PATCH v5 3/5] iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is enabled Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25 ` [RFC PATCH v5 4/5] powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25 ` [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-07  7:25   ` Alexey Kardashevskiy
2017-08-09  6:59   ` David Gibson
2017-08-09  6:59     ` David Gibson
2017-08-09  6:59     ` David Gibson
2017-08-14  9:45 ` [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table Alexey Kardashevskiy
2017-08-14  9:45   ` Alexey Kardashevskiy
2017-08-14  9:45   ` Alexey Kardashevskiy
2017-08-14 13:12   ` Robin Murphy
2017-08-14 13:12     ` Robin Murphy
2017-08-15  1:16     ` Jike Song
2017-08-15  1:16       ` Jike Song
2017-08-15  1:16       ` Jike Song
2017-08-15  1:33       ` Benjamin Herrenschmidt
2017-08-15  1:33         ` Benjamin Herrenschmidt
2017-08-15  1:33         ` Benjamin Herrenschmidt
2017-08-15  1:47         ` Jike Song
2017-08-15  1:47           ` Jike Song
2017-08-15  1:47           ` Jike Song
2017-08-15  5:38           ` Benjamin Herrenschmidt
2017-08-15  5:38             ` Benjamin Herrenschmidt
2017-08-15  5:38             ` Benjamin Herrenschmidt
2017-08-15 14:48         ` David Laight
2017-08-15 14:48           ` David Laight
2017-08-15 14:48           ` David Laight
2017-08-15 14:48           ` David Laight
2017-08-15 14:48           ` David Laight
2017-08-15  5:42     ` Benjamin Herrenschmidt
2017-08-15  5:42       ` Benjamin Herrenschmidt
2017-08-15  5:42       ` Benjamin Herrenschmidt
2017-08-15 16:37     ` Alex Williamson
2017-08-15 16:37       ` Alex Williamson
2017-08-15 16:37       ` Alex Williamson
2017-08-16  0:35       ` Benjamin Herrenschmidt
2017-08-16  0:35         ` Benjamin Herrenschmidt
2017-08-16  0:35         ` Benjamin Herrenschmidt
2017-08-16 16:56         ` Alex Williamson
2017-08-16 16:56           ` Alex Williamson
2017-08-16 16:56           ` Alex Williamson
2017-08-16 16:56           ` Alex Williamson
2017-08-17  4:43           ` Benjamin Herrenschmidt
2017-08-17  4:43             ` Benjamin Herrenschmidt
2017-08-17  4:43             ` Benjamin Herrenschmidt
2017-08-17 10:56           ` David Laight
2017-08-17 10:56             ` David Laight
2017-08-17 10:56             ` David Laight
2017-08-17 19:25             ` Alex Williamson
2017-08-17 19:25               ` Alex Williamson
2017-08-17 19:25               ` Alex Williamson
2017-08-21  2:47   ` Alexey Kardashevskiy
2017-08-21  2:47     ` Alexey Kardashevskiy
2017-08-21  2:47     ` Alexey Kardashevskiy
2017-08-29  2:58     ` Alexey Kardashevskiy
2017-08-29  2:58       ` Alexey Kardashevskiy
2017-08-29  2:58       ` Alexey Kardashevskiy
2017-09-11  3:27       ` Alexey Kardashevskiy
2017-09-11  3:27         ` Alexey Kardashevskiy
2017-09-11  3:27         ` Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.