All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/5] KVM PCIe/MSI passthrough on ARM/ARM64: kernel part 3/3: vfio changes
@ 2016-04-04  8:30 ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

This series allows the user-space to register a reserved IOVA domain.
This completes the kernel integration of the whole functionality on top
of part 1 & 2.

We reuse the VFIO DMA MAP ioctl with a new flag to bridge to the
dma-reserved-iommu API. The number of IOVA pages to provision for MSI
binding is reported through the VFIO_IOMMU_GET_INFO iotcl.

vfio_iommu_type1 checks if the MSI mapping is safe when attaching the
vfio group to the container (allow_unsafe_interrupts modality).

On ARM/ARM64, the IOMMU does not astract IRQ remapping. the modality is
abstracted on MSI controller side. The GICv3 ITS is the first controller
advertising the modality.

More details & context can be found at:
http://www.linaro.org/blog/core-dump/kvm-pciemsi-passthrough-armarm64/

Best Regards

Eric

Testing:
- functional on ARM64 AMD Overdrive HW (single GICv2m frame) with
  x Intel e1000e PCIe card
  x Intel X540-T2 (SR-IOV capable)
- Not tested: ARM GICv3 ITS

References:
[1] [RFC 0/2] VFIO: Add virtual MSI doorbell support
    (https://lkml.org/lkml/2015/7/24/135)
[2] [RFC PATCH 0/6] vfio: Add interface to map MSI pages
    (https://lists.cs.columbia.edu/pipermail/kvmarm/2015-September/016607.html)
[3] [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO
    (http://permalink.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858)

Git: complete series available at
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.6-rc1-pcie-passthrough-v6

previous version at
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc6-pcie-passthrough-rfcv5

QEMU Integration:
[RFC v2 0/8] KVM PCI/MSI passthrough with mach-virt
(http://lists.gnu.org/archive/html/qemu-arm/2016-01/msg00444.html)
https://git.linaro.org/people/eric.auger/qemu.git/shortlog/refs/heads/v2.5.0-pci-passthrough-rfc-v2

History:

RFC v5 -> patch v6:
- split to ease the review process

RFC v4 -> RFC v5:
- take into account Thomas' comments on MSI related patches
  - split "msi: IOMMU map the doorbell address when needed"
  - increase readability and add comments
  - fix style issues
 - split "iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute"
 - platform ITS now advertises IOMMU_CAP_INTR_REMAP
 - fix compilation issue with CONFIG_IOMMU API unset
 - arm-smmu-v3 now advertises DOMAIN_ATTR_MSI_MAPPING

RFC v3 -> v4:
- Move doorbell mapping/unmapping in msi.c
- fix ref count issue on set_affinity: in case of a change in the address
  the previous address is decremented
- doorbell map/unmap now is done on msi composition. Should allow the use
  case for platform MSI controllers
- create dma-reserved-iommu.h/c exposing/implementing a new API dedicated
  to reserved IOVA management (looking like dma-iommu glue)
- series reordering to ease the review:
  - first part is related to IOMMU
  - second related to MSI sub-system
  - third related to VFIO (except arm-smmu IOMMU_CAP_INTR_REMAP removal)
- expose the number of requested IOVA pages through VFIO_IOMMU_GET_INFO
  [this partially addresses Marc's comments on iommu_get/put_single_reserved
   size/alignment problematic - which I did not ignore - but I don't know
   how much I can do at the moment]

RFC v2 -> RFC v3:
- should fix wrong handling of some CONFIG combinations:
  CONFIG_IOVA, CONFIG_IOMMU_API, CONFIG_PCI_MSI_IRQ_DOMAIN
- fix MSI_FLAG_IRQ_REMAPPING setting in GICv3 ITS (although not tested)

PATCH v1 -> RFC v2:
- reverted to RFC since it looks more reasonable ;-) the code is split
  between VFIO, IOMMU, MSI controller and I am not sure I did the right
  choices. Also API need to be further discussed.
- iova API usage in arm-smmu.c.
- MSI controller natively programs the MSI addr with either the PA or IOVA.
  This is not done anymore in vfio-pci driver as suggested by Alex.
- check irq remapping capability of the group

RFC v1 [2] -> PATCH v1:
- use the existing dma map/unmap ioctl interface with a flag to register a
  reserved IOVA range. Use the legacy Rb to store this special vfio_dma.
- a single reserved IOVA contiguous region now is allowed
- use of an RB tree indexed by PA to store allocated reserved slots
- use of a vfio_domain iova_domain to manage iova allocation within the
  window provided by the userspace
- vfio alloc_map/unmap_free take a vfio_group handle
- vfio_group handle is cached in vfio_pci_device
- add ref counting to bindings
- user modality enabled at the end of the series


Eric Auger (5):
  vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
  vfio: allow the user to register reserved iova range for MSI mapping
  vfio/type1: also check IRQ remapping capability at msi domain
  iommu/arm-smmu: do not advertise IOMMU_CAP_INTR_REMAP
  vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO

 drivers/iommu/arm-smmu.c        |   2 +-
 drivers/vfio/vfio_iommu_type1.c | 349 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  14 +-
 3 files changed, 358 insertions(+), 7 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 0/5] KVM PCIe/MSI passthrough on ARM/ARM64: kernel part 3/3: vfio changes
@ 2016-04-04  8:30 ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger-qxv4g6HH51o, eric.auger-QSEj5FYQhm4dnm+yROfE0A,
	robin.murphy-5wv7dgnIgG8, alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, jason-NLaQJdtUoK4Be96aLqz0jA,
	marc.zyngier-5wv7dgnIgG8,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	kvm-u79uwXL29TY76Z2rM5mHXA
  Cc: julien.grall-5wv7dgnIgG8, patches-QSEj5FYQhm4dnm+yROfE0A,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w

This series allows the user-space to register a reserved IOVA domain.
This completes the kernel integration of the whole functionality on top
of part 1 & 2.

We reuse the VFIO DMA MAP ioctl with a new flag to bridge to the
dma-reserved-iommu API. The number of IOVA pages to provision for MSI
binding is reported through the VFIO_IOMMU_GET_INFO iotcl.

vfio_iommu_type1 checks if the MSI mapping is safe when attaching the
vfio group to the container (allow_unsafe_interrupts modality).

On ARM/ARM64, the IOMMU does not astract IRQ remapping. the modality is
abstracted on MSI controller side. The GICv3 ITS is the first controller
advertising the modality.

More details & context can be found at:
http://www.linaro.org/blog/core-dump/kvm-pciemsi-passthrough-armarm64/

Best Regards

Eric

Testing:
- functional on ARM64 AMD Overdrive HW (single GICv2m frame) with
  x Intel e1000e PCIe card
  x Intel X540-T2 (SR-IOV capable)
- Not tested: ARM GICv3 ITS

References:
[1] [RFC 0/2] VFIO: Add virtual MSI doorbell support
    (https://lkml.org/lkml/2015/7/24/135)
[2] [RFC PATCH 0/6] vfio: Add interface to map MSI pages
    (https://lists.cs.columbia.edu/pipermail/kvmarm/2015-September/016607.html)
[3] [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO
    (http://permalink.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858)

Git: complete series available at
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.6-rc1-pcie-passthrough-v6

previous version at
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc6-pcie-passthrough-rfcv5

QEMU Integration:
[RFC v2 0/8] KVM PCI/MSI passthrough with mach-virt
(http://lists.gnu.org/archive/html/qemu-arm/2016-01/msg00444.html)
https://git.linaro.org/people/eric.auger/qemu.git/shortlog/refs/heads/v2.5.0-pci-passthrough-rfc-v2

History:

RFC v5 -> patch v6:
- split to ease the review process

RFC v4 -> RFC v5:
- take into account Thomas' comments on MSI related patches
  - split "msi: IOMMU map the doorbell address when needed"
  - increase readability and add comments
  - fix style issues
 - split "iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute"
 - platform ITS now advertises IOMMU_CAP_INTR_REMAP
 - fix compilation issue with CONFIG_IOMMU API unset
 - arm-smmu-v3 now advertises DOMAIN_ATTR_MSI_MAPPING

RFC v3 -> v4:
- Move doorbell mapping/unmapping in msi.c
- fix ref count issue on set_affinity: in case of a change in the address
  the previous address is decremented
- doorbell map/unmap now is done on msi composition. Should allow the use
  case for platform MSI controllers
- create dma-reserved-iommu.h/c exposing/implementing a new API dedicated
  to reserved IOVA management (looking like dma-iommu glue)
- series reordering to ease the review:
  - first part is related to IOMMU
  - second related to MSI sub-system
  - third related to VFIO (except arm-smmu IOMMU_CAP_INTR_REMAP removal)
- expose the number of requested IOVA pages through VFIO_IOMMU_GET_INFO
  [this partially addresses Marc's comments on iommu_get/put_single_reserved
   size/alignment problematic - which I did not ignore - but I don't know
   how much I can do at the moment]

RFC v2 -> RFC v3:
- should fix wrong handling of some CONFIG combinations:
  CONFIG_IOVA, CONFIG_IOMMU_API, CONFIG_PCI_MSI_IRQ_DOMAIN
- fix MSI_FLAG_IRQ_REMAPPING setting in GICv3 ITS (although not tested)

PATCH v1 -> RFC v2:
- reverted to RFC since it looks more reasonable ;-) the code is split
  between VFIO, IOMMU, MSI controller and I am not sure I did the right
  choices. Also API need to be further discussed.
- iova API usage in arm-smmu.c.
- MSI controller natively programs the MSI addr with either the PA or IOVA.
  This is not done anymore in vfio-pci driver as suggested by Alex.
- check irq remapping capability of the group

RFC v1 [2] -> PATCH v1:
- use the existing dma map/unmap ioctl interface with a flag to register a
  reserved IOVA range. Use the legacy Rb to store this special vfio_dma.
- a single reserved IOVA contiguous region now is allowed
- use of an RB tree indexed by PA to store allocated reserved slots
- use of a vfio_domain iova_domain to manage iova allocation within the
  window provided by the userspace
- vfio alloc_map/unmap_free take a vfio_group handle
- vfio_group handle is cached in vfio_pci_device
- add ref counting to bindings
- user modality enabled at the end of the series


Eric Auger (5):
  vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
  vfio: allow the user to register reserved iova range for MSI mapping
  vfio/type1: also check IRQ remapping capability at msi domain
  iommu/arm-smmu: do not advertise IOMMU_CAP_INTR_REMAP
  vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO

 drivers/iommu/arm-smmu.c        |   2 +-
 drivers/vfio/vfio_iommu_type1.c | 349 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  14 +-
 3 files changed, 358 insertions(+), 7 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 0/5] KVM PCIe/MSI passthrough on ARM/ARM64: kernel part 3/3: vfio changes
@ 2016-04-04  8:30 ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: linux-arm-kernel

This series allows the user-space to register a reserved IOVA domain.
This completes the kernel integration of the whole functionality on top
of part 1 & 2.

We reuse the VFIO DMA MAP ioctl with a new flag to bridge to the
dma-reserved-iommu API. The number of IOVA pages to provision for MSI
binding is reported through the VFIO_IOMMU_GET_INFO iotcl.

vfio_iommu_type1 checks if the MSI mapping is safe when attaching the
vfio group to the container (allow_unsafe_interrupts modality).

On ARM/ARM64, the IOMMU does not astract IRQ remapping. the modality is
abstracted on MSI controller side. The GICv3 ITS is the first controller
advertising the modality.

More details & context can be found at:
http://www.linaro.org/blog/core-dump/kvm-pciemsi-passthrough-armarm64/

Best Regards

Eric

Testing:
- functional on ARM64 AMD Overdrive HW (single GICv2m frame) with
  x Intel e1000e PCIe card
  x Intel X540-T2 (SR-IOV capable)
- Not tested: ARM GICv3 ITS

References:
[1] [RFC 0/2] VFIO: Add virtual MSI doorbell support
    (https://lkml.org/lkml/2015/7/24/135)
[2] [RFC PATCH 0/6] vfio: Add interface to map MSI pages
    (https://lists.cs.columbia.edu/pipermail/kvmarm/2015-September/016607.html)
[3] [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO
    (http://permalink.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858)

Git: complete series available at
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.6-rc1-pcie-passthrough-v6

previous version at
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc6-pcie-passthrough-rfcv5

QEMU Integration:
[RFC v2 0/8] KVM PCI/MSI passthrough with mach-virt
(http://lists.gnu.org/archive/html/qemu-arm/2016-01/msg00444.html)
https://git.linaro.org/people/eric.auger/qemu.git/shortlog/refs/heads/v2.5.0-pci-passthrough-rfc-v2

History:

RFC v5 -> patch v6:
- split to ease the review process

RFC v4 -> RFC v5:
- take into account Thomas' comments on MSI related patches
  - split "msi: IOMMU map the doorbell address when needed"
  - increase readability and add comments
  - fix style issues
 - split "iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute"
 - platform ITS now advertises IOMMU_CAP_INTR_REMAP
 - fix compilation issue with CONFIG_IOMMU API unset
 - arm-smmu-v3 now advertises DOMAIN_ATTR_MSI_MAPPING

RFC v3 -> v4:
- Move doorbell mapping/unmapping in msi.c
- fix ref count issue on set_affinity: in case of a change in the address
  the previous address is decremented
- doorbell map/unmap now is done on msi composition. Should allow the use
  case for platform MSI controllers
- create dma-reserved-iommu.h/c exposing/implementing a new API dedicated
  to reserved IOVA management (looking like dma-iommu glue)
- series reordering to ease the review:
  - first part is related to IOMMU
  - second related to MSI sub-system
  - third related to VFIO (except arm-smmu IOMMU_CAP_INTR_REMAP removal)
- expose the number of requested IOVA pages through VFIO_IOMMU_GET_INFO
  [this partially addresses Marc's comments on iommu_get/put_single_reserved
   size/alignment problematic - which I did not ignore - but I don't know
   how much I can do at the moment]

RFC v2 -> RFC v3:
- should fix wrong handling of some CONFIG combinations:
  CONFIG_IOVA, CONFIG_IOMMU_API, CONFIG_PCI_MSI_IRQ_DOMAIN
- fix MSI_FLAG_IRQ_REMAPPING setting in GICv3 ITS (although not tested)

PATCH v1 -> RFC v2:
- reverted to RFC since it looks more reasonable ;-) the code is split
  between VFIO, IOMMU, MSI controller and I am not sure I did the right
  choices. Also API need to be further discussed.
- iova API usage in arm-smmu.c.
- MSI controller natively programs the MSI addr with either the PA or IOVA.
  This is not done anymore in vfio-pci driver as suggested by Alex.
- check irq remapping capability of the group

RFC v1 [2] -> PATCH v1:
- use the existing dma map/unmap ioctl interface with a flag to register a
  reserved IOVA range. Use the legacy Rb to store this special vfio_dma.
- a single reserved IOVA contiguous region now is allowed
- use of an RB tree indexed by PA to store allocated reserved slots
- use of a vfio_domain iova_domain to manage iova allocation within the
  window provided by the userspace
- vfio alloc_map/unmap_free take a vfio_group handle
- vfio_group handle is cached in vfio_pci_device
- add ref counting to bindings
- user modality enabled at the end of the series


Eric Auger (5):
  vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
  vfio: allow the user to register reserved iova range for MSI mapping
  vfio/type1: also check IRQ remapping capability at msi domain
  iommu/arm-smmu: do not advertise IOMMU_CAP_INTR_REMAP
  vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO

 drivers/iommu/arm-smmu.c        |   2 +-
 drivers/vfio/vfio_iommu_type1.c | 349 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  14 +-
 3 files changed, 358 insertions(+), 7 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 1/5] vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
  2016-04-04  8:30 ` Eric Auger
  (?)
@ 2016-04-04  8:30   ` Eric Auger
  -1 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

We introduce a vfio_dma type since we will need to discriminate
legacy vfio_dma's from new reserved ones. Since those latter are
not mapped at registration, some treatments need to be reworked:
removal, replay. Currently they are unplugged. In subsequent patches
they will be reworked.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/vfio/vfio_iommu_type1.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e9..c9ddbde 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -53,6 +53,15 @@ module_param_named(disable_hugepages,
 MODULE_PARM_DESC(disable_hugepages,
 		 "Disable VFIO IOMMU support for IOMMU hugepages.");
 
+enum vfio_iova_type {
+	VFIO_IOVA_USER = 0, /* standard IOVA used to map user vaddr */
+	/*
+	 * IOVA reserved to map special host physical addresses,
+	 * MSI frames for instance
+	 */
+	VFIO_IOVA_RESERVED,
+};
+
 struct vfio_iommu {
 	struct list_head	domain_list;
 	struct mutex		lock;
@@ -75,6 +84,7 @@ struct vfio_dma {
 	unsigned long		vaddr;		/* Process virtual addr */
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
+	enum vfio_iova_type	type;		/* type of IOVA */
 };
 
 struct vfio_group {
@@ -395,7 +405,8 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
-	vfio_unmap_unpin(iommu, dma);
+	if (likely(dma->type != VFIO_IOVA_RESERVED))
+		vfio_unmap_unpin(iommu, dma);
 	vfio_unlink_dma(iommu, dma);
 	kfree(dma);
 }
@@ -671,6 +682,10 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 		dma_addr_t iova;
 
 		dma = rb_entry(n, struct vfio_dma, node);
+
+		if (unlikely(dma->type == VFIO_IOVA_RESERVED))
+			continue;
+
 		iova = dma->iova;
 
 		while (iova < dma->iova + dma->size) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 1/5] vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: patches, Manish.Jaggi, linux-kernel, iommu

We introduce a vfio_dma type since we will need to discriminate
legacy vfio_dma's from new reserved ones. Since those latter are
not mapped at registration, some treatments need to be reworked:
removal, replay. Currently they are unplugged. In subsequent patches
they will be reworked.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/vfio/vfio_iommu_type1.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e9..c9ddbde 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -53,6 +53,15 @@ module_param_named(disable_hugepages,
 MODULE_PARM_DESC(disable_hugepages,
 		 "Disable VFIO IOMMU support for IOMMU hugepages.");
 
+enum vfio_iova_type {
+	VFIO_IOVA_USER = 0, /* standard IOVA used to map user vaddr */
+	/*
+	 * IOVA reserved to map special host physical addresses,
+	 * MSI frames for instance
+	 */
+	VFIO_IOVA_RESERVED,
+};
+
 struct vfio_iommu {
 	struct list_head	domain_list;
 	struct mutex		lock;
@@ -75,6 +84,7 @@ struct vfio_dma {
 	unsigned long		vaddr;		/* Process virtual addr */
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
+	enum vfio_iova_type	type;		/* type of IOVA */
 };
 
 struct vfio_group {
@@ -395,7 +405,8 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
-	vfio_unmap_unpin(iommu, dma);
+	if (likely(dma->type != VFIO_IOVA_RESERVED))
+		vfio_unmap_unpin(iommu, dma);
 	vfio_unlink_dma(iommu, dma);
 	kfree(dma);
 }
@@ -671,6 +682,10 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 		dma_addr_t iova;
 
 		dma = rb_entry(n, struct vfio_dma, node);
+
+		if (unlikely(dma->type == VFIO_IOVA_RESERVED))
+			continue;
+
 		iova = dma->iova;
 
 		while (iova < dma->iova + dma->size) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 1/5] vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: linux-arm-kernel

We introduce a vfio_dma type since we will need to discriminate
legacy vfio_dma's from new reserved ones. Since those latter are
not mapped at registration, some treatments need to be reworked:
removal, replay. Currently they are unplugged. In subsequent patches
they will be reworked.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/vfio/vfio_iommu_type1.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e9..c9ddbde 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -53,6 +53,15 @@ module_param_named(disable_hugepages,
 MODULE_PARM_DESC(disable_hugepages,
 		 "Disable VFIO IOMMU support for IOMMU hugepages.");
 
+enum vfio_iova_type {
+	VFIO_IOVA_USER = 0, /* standard IOVA used to map user vaddr */
+	/*
+	 * IOVA reserved to map special host physical addresses,
+	 * MSI frames for instance
+	 */
+	VFIO_IOVA_RESERVED,
+};
+
 struct vfio_iommu {
 	struct list_head	domain_list;
 	struct mutex		lock;
@@ -75,6 +84,7 @@ struct vfio_dma {
 	unsigned long		vaddr;		/* Process virtual addr */
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
+	enum vfio_iova_type	type;		/* type of IOVA */
 };
 
 struct vfio_group {
@@ -395,7 +405,8 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
-	vfio_unmap_unpin(iommu, dma);
+	if (likely(dma->type != VFIO_IOVA_RESERVED))
+		vfio_unmap_unpin(iommu, dma);
 	vfio_unlink_dma(iommu, dma);
 	kfree(dma);
 }
@@ -671,6 +682,10 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 		dma_addr_t iova;
 
 		dma = rb_entry(n, struct vfio_dma, node);
+
+		if (unlikely(dma->type == VFIO_IOVA_RESERVED))
+			continue;
+
 		iova = dma->iova;
 
 		while (iova < dma->iova + dma->size) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

The user is allowed to [un]register a reserved IOVA range by using the
DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
It provides the base address and the size. This region is stored in the
vfio_dma rb tree. At that point the iova range is not mapped to any target
address yet. The host kernel will use those iova when needed, typically
when the VFIO-PCI device allocates its MSIs.

This patch also handles the destruction of the reserved binding RB-tree and
domain's iova_domains.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>

---
v3 -> v4:
- use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
- protect vfio_register_reserved_iova_range implementation with
  CONFIG_IOMMU_DMA_RESERVED
- handle unregistration by user-space and on vfio_iommu_type1 release

v1 -> v2:
- set returned value according to alloc_reserved_iova_domain result
- free the iova domains in case any error occurs

RFC v1 -> v1:
- takes into account Alex comments, based on
  [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
- use the existing dma map/unmap ioctl interface with a flag to register
  a reserved IOVA range. A single reserved iova region is allowed.

Conflicts:
	drivers/vfio/vfio_iommu_type1.c
---
 drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  12 +++-
 2 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c9ddbde..4497b20 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/dma-reserved-iommu.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_lock_acct(-unlocked);
 }
 
+static void vfio_unmap_reserved(struct vfio_iommu *iommu)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	struct vfio_domain *d;
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		iommu_unmap_reserved(d->domain);
+#endif
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	if (likely(dma->type != VFIO_IOVA_RESERVED))
 		vfio_unmap_unpin(iommu, dma);
+	else
+		vfio_unmap_reserved(iommu);
 	vfio_unlink_dma(iommu, dma);
 	kfree(dma);
 }
@@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	 */
 	if (iommu->v2) {
 		dma = vfio_find_dma(iommu, unmap->iova, 0);
-		if (dma && dma->iova != unmap->iova) {
+		if (dma && (dma->iova != unmap->iova ||
+			   (dma->type == VFIO_IOVA_RESERVED))) {
 			ret = -EINVAL;
 			goto unlock;
 		}
@@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	}
 
 	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
+		if (dma->type == VFIO_IOVA_RESERVED) {
+			ret = -EINVAL;
+			goto unlock;
+		}
 		if (!iommu->v2 && unmap->iova > dma->iova)
 			break;
 		unmapped += dma->size;
@@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
+			   struct vfio_iommu_type1_dma_map *map)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	dma_addr_t iova = map->iova;
+	size_t size = map->size;
+	uint64_t mask;
+	struct vfio_dma *dma;
+	int ret = 0;
+	struct vfio_domain *d;
+	unsigned long order;
+
+	/* Verify that none of our __u64 fields overflow */
+	if (map->size != size || map->iova != iova)
+		return -EINVAL;
+
+	order =  __ffs(vfio_pgsize_bitmap(iommu));
+	mask = ((uint64_t)1 << order) - 1;
+
+	WARN_ON(mask & PAGE_MASK);
+
+	if (!size || (size | iova) & mask)
+		return -EINVAL;
+
+	/* Don't allow IOVA address wrap */
+	if (iova + size - 1 < iova)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (vfio_find_dma(iommu, iova, size)) {
+		ret =  -EEXIST;
+		goto out;
+	}
+
+	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+	if (!dma) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	dma->iova = iova;
+	dma->size = size;
+	dma->type = VFIO_IOVA_RESERVED;
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
+							size, order);
+
+	if (ret) {
+		list_for_each_entry(d, &iommu->domain_list, next)
+			iommu_free_reserved_iova_domain(d->domain);
+		goto out;
+	}
+
+	vfio_link_dma(iommu, dma);
+
+out:
+	mutex_unlock(&iommu->lock);
+	return ret;
+#else /* CONFIG_IOMMU_DMA_RESERVED */
+	return -ENODEV;
+#endif
+}
+
+static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
+				struct vfio_iommu_type1_dma_unmap *unmap)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	dma_addr_t iova = unmap->iova;
+	struct vfio_dma *dma;
+	size_t size = unmap->size;
+	uint64_t mask;
+	unsigned long order;
+
+	/* Verify that none of our __u64 fields overflow */
+	if (unmap->size != size || unmap->iova != iova)
+		return;
+
+	order =  __ffs(vfio_pgsize_bitmap(iommu));
+	mask = ((uint64_t)1 << order) - 1;
+
+	WARN_ON(mask & PAGE_MASK);
+
+	if (!size || (size | iova) & mask)
+		return;
+
+	/* Don't allow IOVA address wrap */
+	if (iova + size - 1 < iova)
+		return;
+
+	mutex_lock(&iommu->lock);
+
+	dma = vfio_find_dma(iommu, iova, size);
+
+	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
+		unmap->size = 0;
+		goto out;
+	}
+
+	unmap->size =  dma->size;
+	vfio_remove_dma(iommu, dma);
+
+out:
+	mutex_unlock(&iommu->lock);
+#endif
+}
+
 static int vfio_bus_type(struct device *dev, void *data)
 {
 	struct bus_type **bus = data;
@@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	struct vfio_group *group, *group_tmp;
 
 	vfio_iommu_unmap_unpin_all(iommu);
+	vfio_unmap_reserved(iommu);
 
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
@@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
 		struct vfio_iommu_type1_dma_map map;
 		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
-				VFIO_DMA_MAP_FLAG_WRITE;
+				VFIO_DMA_MAP_FLAG_WRITE |
+				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (map.argsz < minsz || map.flags & ~mask)
 			return -EINVAL;
 
+		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
+			return vfio_register_reserved_iova_range(iommu, &map);
+
 		return vfio_dma_do_map(iommu, &map);
 
 	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
@@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (unmap.argsz < minsz || unmap.flags)
 			return -EINVAL;
 
+		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
+			vfio_unregister_reserved_iova_range(iommu, &unmap);
+			goto out;
+		}
+
 		ret = vfio_dma_do_unmap(iommu, &unmap);
 		if (ret)
 			return ret;
 
+out:
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 255a211..a49be8a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
+ * IOVA region which will be used on some platforms to map the host MSI frame.
+ * in that specific case, vaddr and prot are ignored. The requirement for
+ * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
+ * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
+ * MSI_RESERVED_IOVA region can be registered
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+/* reserved iova for MSI vectors*/
+#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
@@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
  * Caller sets argsz.  The actual unmapped size is returned in the size
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
- * succeed.
+ * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
+ * flag set.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger-qxv4g6HH51o, eric.auger-QSEj5FYQhm4dnm+yROfE0A,
	robin.murphy-5wv7dgnIgG8, alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, jason-NLaQJdtUoK4Be96aLqz0jA,
	marc.zyngier-5wv7dgnIgG8,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	kvm-u79uwXL29TY76Z2rM5mHXA
  Cc: julien.grall-5wv7dgnIgG8, patches-QSEj5FYQhm4dnm+yROfE0A,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w

The user is allowed to [un]register a reserved IOVA range by using the
DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
It provides the base address and the size. This region is stored in the
vfio_dma rb tree. At that point the iova range is not mapped to any target
address yet. The host kernel will use those iova when needed, typically
when the VFIO-PCI device allocates its MSIs.

This patch also handles the destruction of the reserved binding RB-tree and
domain's iova_domains.

Signed-off-by: Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Signed-off-by: Bharat Bhushan <Bharat.Bhushan-KZfg59tc24xl57MIdRCFDg@public.gmane.org>

---
v3 -> v4:
- use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
- protect vfio_register_reserved_iova_range implementation with
  CONFIG_IOMMU_DMA_RESERVED
- handle unregistration by user-space and on vfio_iommu_type1 release

v1 -> v2:
- set returned value according to alloc_reserved_iova_domain result
- free the iova domains in case any error occurs

RFC v1 -> v1:
- takes into account Alex comments, based on
  [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
- use the existing dma map/unmap ioctl interface with a flag to register
  a reserved IOVA range. A single reserved iova region is allowed.

Conflicts:
	drivers/vfio/vfio_iommu_type1.c
---
 drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  12 +++-
 2 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c9ddbde..4497b20 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/dma-reserved-iommu.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
@@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_lock_acct(-unlocked);
 }
 
+static void vfio_unmap_reserved(struct vfio_iommu *iommu)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	struct vfio_domain *d;
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		iommu_unmap_reserved(d->domain);
+#endif
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	if (likely(dma->type != VFIO_IOVA_RESERVED))
 		vfio_unmap_unpin(iommu, dma);
+	else
+		vfio_unmap_reserved(iommu);
 	vfio_unlink_dma(iommu, dma);
 	kfree(dma);
 }
@@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	 */
 	if (iommu->v2) {
 		dma = vfio_find_dma(iommu, unmap->iova, 0);
-		if (dma && dma->iova != unmap->iova) {
+		if (dma && (dma->iova != unmap->iova ||
+			   (dma->type == VFIO_IOVA_RESERVED))) {
 			ret = -EINVAL;
 			goto unlock;
 		}
@@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	}
 
 	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
+		if (dma->type == VFIO_IOVA_RESERVED) {
+			ret = -EINVAL;
+			goto unlock;
+		}
 		if (!iommu->v2 && unmap->iova > dma->iova)
 			break;
 		unmapped += dma->size;
@@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
+			   struct vfio_iommu_type1_dma_map *map)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	dma_addr_t iova = map->iova;
+	size_t size = map->size;
+	uint64_t mask;
+	struct vfio_dma *dma;
+	int ret = 0;
+	struct vfio_domain *d;
+	unsigned long order;
+
+	/* Verify that none of our __u64 fields overflow */
+	if (map->size != size || map->iova != iova)
+		return -EINVAL;
+
+	order =  __ffs(vfio_pgsize_bitmap(iommu));
+	mask = ((uint64_t)1 << order) - 1;
+
+	WARN_ON(mask & PAGE_MASK);
+
+	if (!size || (size | iova) & mask)
+		return -EINVAL;
+
+	/* Don't allow IOVA address wrap */
+	if (iova + size - 1 < iova)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (vfio_find_dma(iommu, iova, size)) {
+		ret =  -EEXIST;
+		goto out;
+	}
+
+	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+	if (!dma) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	dma->iova = iova;
+	dma->size = size;
+	dma->type = VFIO_IOVA_RESERVED;
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
+							size, order);
+
+	if (ret) {
+		list_for_each_entry(d, &iommu->domain_list, next)
+			iommu_free_reserved_iova_domain(d->domain);
+		goto out;
+	}
+
+	vfio_link_dma(iommu, dma);
+
+out:
+	mutex_unlock(&iommu->lock);
+	return ret;
+#else /* CONFIG_IOMMU_DMA_RESERVED */
+	return -ENODEV;
+#endif
+}
+
+static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
+				struct vfio_iommu_type1_dma_unmap *unmap)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	dma_addr_t iova = unmap->iova;
+	struct vfio_dma *dma;
+	size_t size = unmap->size;
+	uint64_t mask;
+	unsigned long order;
+
+	/* Verify that none of our __u64 fields overflow */
+	if (unmap->size != size || unmap->iova != iova)
+		return;
+
+	order =  __ffs(vfio_pgsize_bitmap(iommu));
+	mask = ((uint64_t)1 << order) - 1;
+
+	WARN_ON(mask & PAGE_MASK);
+
+	if (!size || (size | iova) & mask)
+		return;
+
+	/* Don't allow IOVA address wrap */
+	if (iova + size - 1 < iova)
+		return;
+
+	mutex_lock(&iommu->lock);
+
+	dma = vfio_find_dma(iommu, iova, size);
+
+	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
+		unmap->size = 0;
+		goto out;
+	}
+
+	unmap->size =  dma->size;
+	vfio_remove_dma(iommu, dma);
+
+out:
+	mutex_unlock(&iommu->lock);
+#endif
+}
+
 static int vfio_bus_type(struct device *dev, void *data)
 {
 	struct bus_type **bus = data;
@@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	struct vfio_group *group, *group_tmp;
 
 	vfio_iommu_unmap_unpin_all(iommu);
+	vfio_unmap_reserved(iommu);
 
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
@@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
 		struct vfio_iommu_type1_dma_map map;
 		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
-				VFIO_DMA_MAP_FLAG_WRITE;
+				VFIO_DMA_MAP_FLAG_WRITE |
+				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (map.argsz < minsz || map.flags & ~mask)
 			return -EINVAL;
 
+		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
+			return vfio_register_reserved_iova_range(iommu, &map);
+
 		return vfio_dma_do_map(iommu, &map);
 
 	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
@@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (unmap.argsz < minsz || unmap.flags)
 			return -EINVAL;
 
+		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
+			vfio_unregister_reserved_iova_range(iommu, &unmap);
+			goto out;
+		}
+
 		ret = vfio_dma_do_unmap(iommu, &unmap);
 		if (ret)
 			return ret;
 
+out:
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 255a211..a49be8a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
+ * IOVA region which will be used on some platforms to map the host MSI frame.
+ * in that specific case, vaddr and prot are ignored. The requirement for
+ * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
+ * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
+ * MSI_RESERVED_IOVA region can be registered
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+/* reserved iova for MSI vectors*/
+#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
@@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
  * Caller sets argsz.  The actual unmapped size is returned in the size
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
- * succeed.
+ * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
+ * flag set.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: linux-arm-kernel

The user is allowed to [un]register a reserved IOVA range by using the
DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
It provides the base address and the size. This region is stored in the
vfio_dma rb tree. At that point the iova range is not mapped to any target
address yet. The host kernel will use those iova when needed, typically
when the VFIO-PCI device allocates its MSIs.

This patch also handles the destruction of the reserved binding RB-tree and
domain's iova_domains.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>

---
v3 -> v4:
- use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
- protect vfio_register_reserved_iova_range implementation with
  CONFIG_IOMMU_DMA_RESERVED
- handle unregistration by user-space and on vfio_iommu_type1 release

v1 -> v2:
- set returned value according to alloc_reserved_iova_domain result
- free the iova domains in case any error occurs

RFC v1 -> v1:
- takes into account Alex comments, based on
  [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
- use the existing dma map/unmap ioctl interface with a flag to register
  a reserved IOVA range. A single reserved iova region is allowed.

Conflicts:
	drivers/vfio/vfio_iommu_type1.c
---
 drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  12 +++-
 2 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c9ddbde..4497b20 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/dma-reserved-iommu.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_lock_acct(-unlocked);
 }
 
+static void vfio_unmap_reserved(struct vfio_iommu *iommu)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	struct vfio_domain *d;
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		iommu_unmap_reserved(d->domain);
+#endif
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	if (likely(dma->type != VFIO_IOVA_RESERVED))
 		vfio_unmap_unpin(iommu, dma);
+	else
+		vfio_unmap_reserved(iommu);
 	vfio_unlink_dma(iommu, dma);
 	kfree(dma);
 }
@@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	 */
 	if (iommu->v2) {
 		dma = vfio_find_dma(iommu, unmap->iova, 0);
-		if (dma && dma->iova != unmap->iova) {
+		if (dma && (dma->iova != unmap->iova ||
+			   (dma->type == VFIO_IOVA_RESERVED))) {
 			ret = -EINVAL;
 			goto unlock;
 		}
@@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	}
 
 	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
+		if (dma->type == VFIO_IOVA_RESERVED) {
+			ret = -EINVAL;
+			goto unlock;
+		}
 		if (!iommu->v2 && unmap->iova > dma->iova)
 			break;
 		unmapped += dma->size;
@@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
+			   struct vfio_iommu_type1_dma_map *map)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	dma_addr_t iova = map->iova;
+	size_t size = map->size;
+	uint64_t mask;
+	struct vfio_dma *dma;
+	int ret = 0;
+	struct vfio_domain *d;
+	unsigned long order;
+
+	/* Verify that none of our __u64 fields overflow */
+	if (map->size != size || map->iova != iova)
+		return -EINVAL;
+
+	order =  __ffs(vfio_pgsize_bitmap(iommu));
+	mask = ((uint64_t)1 << order) - 1;
+
+	WARN_ON(mask & PAGE_MASK);
+
+	if (!size || (size | iova) & mask)
+		return -EINVAL;
+
+	/* Don't allow IOVA address wrap */
+	if (iova + size - 1 < iova)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (vfio_find_dma(iommu, iova, size)) {
+		ret =  -EEXIST;
+		goto out;
+	}
+
+	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+	if (!dma) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	dma->iova = iova;
+	dma->size = size;
+	dma->type = VFIO_IOVA_RESERVED;
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
+							size, order);
+
+	if (ret) {
+		list_for_each_entry(d, &iommu->domain_list, next)
+			iommu_free_reserved_iova_domain(d->domain);
+		goto out;
+	}
+
+	vfio_link_dma(iommu, dma);
+
+out:
+	mutex_unlock(&iommu->lock);
+	return ret;
+#else /* CONFIG_IOMMU_DMA_RESERVED */
+	return -ENODEV;
+#endif
+}
+
+static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
+				struct vfio_iommu_type1_dma_unmap *unmap)
+{
+#ifdef CONFIG_IOMMU_DMA_RESERVED
+	dma_addr_t iova = unmap->iova;
+	struct vfio_dma *dma;
+	size_t size = unmap->size;
+	uint64_t mask;
+	unsigned long order;
+
+	/* Verify that none of our __u64 fields overflow */
+	if (unmap->size != size || unmap->iova != iova)
+		return;
+
+	order =  __ffs(vfio_pgsize_bitmap(iommu));
+	mask = ((uint64_t)1 << order) - 1;
+
+	WARN_ON(mask & PAGE_MASK);
+
+	if (!size || (size | iova) & mask)
+		return;
+
+	/* Don't allow IOVA address wrap */
+	if (iova + size - 1 < iova)
+		return;
+
+	mutex_lock(&iommu->lock);
+
+	dma = vfio_find_dma(iommu, iova, size);
+
+	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
+		unmap->size = 0;
+		goto out;
+	}
+
+	unmap->size =  dma->size;
+	vfio_remove_dma(iommu, dma);
+
+out:
+	mutex_unlock(&iommu->lock);
+#endif
+}
+
 static int vfio_bus_type(struct device *dev, void *data)
 {
 	struct bus_type **bus = data;
@@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	struct vfio_group *group, *group_tmp;
 
 	vfio_iommu_unmap_unpin_all(iommu);
+	vfio_unmap_reserved(iommu);
 
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
@@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
 		struct vfio_iommu_type1_dma_map map;
 		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
-				VFIO_DMA_MAP_FLAG_WRITE;
+				VFIO_DMA_MAP_FLAG_WRITE |
+				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (map.argsz < minsz || map.flags & ~mask)
 			return -EINVAL;
 
+		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
+			return vfio_register_reserved_iova_range(iommu, &map);
+
 		return vfio_dma_do_map(iommu, &map);
 
 	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
@@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (unmap.argsz < minsz || unmap.flags)
 			return -EINVAL;
 
+		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
+			vfio_unregister_reserved_iova_range(iommu, &unmap);
+			goto out;
+		}
+
 		ret = vfio_dma_do_unmap(iommu, &unmap);
 		if (ret)
 			return ret;
 
+out:
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 255a211..a49be8a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * In case MSI_RESERVED_IOVA flag is set, the API only aims@registering an
+ * IOVA region which will be used on some platforms to map the host MSI frame.
+ * in that specific case, vaddr and prot are ignored. The requirement for
+ * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
+ * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
+ * MSI_RESERVED_IOVA region can be registered
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+/* reserved iova for MSI vectors*/
+#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
@@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
  * Caller sets argsz.  The actual unmapped size is returned in the size
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
- * succeed.
+ * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
+ * flag set.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 3/5] vfio/type1: also check IRQ remapping capability at msi domain
  2016-04-04  8:30 ` Eric Auger
  (?)
@ 2016-04-04  8:30   ` Eric Auger
  -1 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

On x86 IRQ remapping is abstracted by the IOMMU. On ARM this is abstracted
by the msi controller. vfio_safe_irq_domain allows to check whether
interrupts are "safe" for a given device. They are if the device does
not use MSI or if the device uses MSI and the msi-parent controller
supports IRQ remapping.

Then we check at group level if all devices have safe interrupts: if not,
we only allow the group to be attached if allow_unsafe_interrupts is set.

At this point ARM sMMU still advertises IOMMU_CAP_INTR_REMAP. This is
changed in next patch.

Signed-off-by: Eric Auger <eric.auger@linaro.org>

---
v3 -> v4:
- rename vfio_msi_parent_irq_remapping_capable into vfio_safe_irq_domain
  and irq_remapping into safe_irq_domains

v2 -> v3:
- protect vfio_msi_parent_irq_remapping_capable with
  CONFIG_GENERIC_MSI_IRQ_DOMAIN
---
 drivers/vfio/vfio_iommu_type1.c | 44 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4497b20..b330b81 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -37,6 +37,8 @@
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
 #include <linux/dma-reserved-iommu.h>
+#include <linux/irqdomain.h>
+#include <linux/msi.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -788,6 +790,33 @@ static int vfio_bus_type(struct device *dev, void *data)
 	return 0;
 }
 
+/**
+ * vfio_safe_irq_domain: returns whether the irq domain
+ * the device is attached to is safe with respect to MSI isolation.
+ * If the irq domain is not an MSI domain, we return it is safe.
+ *
+ * @dev: device handle
+ * @data: unused
+ * returns 0 if the irq domain is safe, -1 if not.
+ */
+static int vfio_safe_irq_domain(struct device *dev, void *data)
+{
+#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+	struct irq_domain *domain;
+	struct msi_domain_info *info;
+
+	domain = dev_get_msi_domain(dev);
+	if (!domain)
+		return 0;
+
+	info = msi_get_domain_info(domain);
+
+	if (!(info->flags & MSI_FLAG_IRQ_REMAPPING))
+		return -1;
+#endif
+	return 0;
+}
+
 static int vfio_iommu_replay(struct vfio_iommu *iommu,
 			     struct vfio_domain *domain)
 {
@@ -882,7 +911,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	struct vfio_group *group, *g;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
-	int ret;
+	int ret, safe_irq_domains;
 
 	mutex_lock(&iommu->lock);
 
@@ -905,6 +934,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	group->iommu_group = iommu_group;
 
+	/*
+	 * Determine if all the devices of the group have a safe irq domain
+	 * with respect to MSI isolation
+	 */
+	safe_irq_domains = !iommu_group_for_each_dev(iommu_group, &bus,
+				       vfio_safe_irq_domain);
+
 	/* Determine bus_type in order to allocate a domain */
 	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
 	if (ret)
@@ -932,8 +968,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	INIT_LIST_HEAD(&domain->group_list);
 	list_add(&group->next, &domain->group_list);
 
+	/*
+	 * to advertise safe interrupts either the IOMMU or the MSI controllers
+	 * must support IRQ remapping/interrupt translation
+	 */
 	if (!allow_unsafe_interrupts &&
-	    !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
+	    (!iommu_capable(bus, IOMMU_CAP_INTR_REMAP) && !safe_irq_domains)) {
 		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
 		       __func__);
 		ret = -EPERM;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 3/5] vfio/type1: also check IRQ remapping capability at msi domain
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: patches, Manish.Jaggi, linux-kernel, iommu

On x86 IRQ remapping is abstracted by the IOMMU. On ARM this is abstracted
by the msi controller. vfio_safe_irq_domain allows to check whether
interrupts are "safe" for a given device. They are if the device does
not use MSI or if the device uses MSI and the msi-parent controller
supports IRQ remapping.

Then we check at group level if all devices have safe interrupts: if not,
we only allow the group to be attached if allow_unsafe_interrupts is set.

At this point ARM sMMU still advertises IOMMU_CAP_INTR_REMAP. This is
changed in next patch.

Signed-off-by: Eric Auger <eric.auger@linaro.org>

---
v3 -> v4:
- rename vfio_msi_parent_irq_remapping_capable into vfio_safe_irq_domain
  and irq_remapping into safe_irq_domains

v2 -> v3:
- protect vfio_msi_parent_irq_remapping_capable with
  CONFIG_GENERIC_MSI_IRQ_DOMAIN
---
 drivers/vfio/vfio_iommu_type1.c | 44 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4497b20..b330b81 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -37,6 +37,8 @@
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
 #include <linux/dma-reserved-iommu.h>
+#include <linux/irqdomain.h>
+#include <linux/msi.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -788,6 +790,33 @@ static int vfio_bus_type(struct device *dev, void *data)
 	return 0;
 }
 
+/**
+ * vfio_safe_irq_domain: returns whether the irq domain
+ * the device is attached to is safe with respect to MSI isolation.
+ * If the irq domain is not an MSI domain, we return it is safe.
+ *
+ * @dev: device handle
+ * @data: unused
+ * returns 0 if the irq domain is safe, -1 if not.
+ */
+static int vfio_safe_irq_domain(struct device *dev, void *data)
+{
+#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+	struct irq_domain *domain;
+	struct msi_domain_info *info;
+
+	domain = dev_get_msi_domain(dev);
+	if (!domain)
+		return 0;
+
+	info = msi_get_domain_info(domain);
+
+	if (!(info->flags & MSI_FLAG_IRQ_REMAPPING))
+		return -1;
+#endif
+	return 0;
+}
+
 static int vfio_iommu_replay(struct vfio_iommu *iommu,
 			     struct vfio_domain *domain)
 {
@@ -882,7 +911,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	struct vfio_group *group, *g;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
-	int ret;
+	int ret, safe_irq_domains;
 
 	mutex_lock(&iommu->lock);
 
@@ -905,6 +934,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	group->iommu_group = iommu_group;
 
+	/*
+	 * Determine if all the devices of the group have a safe irq domain
+	 * with respect to MSI isolation
+	 */
+	safe_irq_domains = !iommu_group_for_each_dev(iommu_group, &bus,
+				       vfio_safe_irq_domain);
+
 	/* Determine bus_type in order to allocate a domain */
 	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
 	if (ret)
@@ -932,8 +968,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	INIT_LIST_HEAD(&domain->group_list);
 	list_add(&group->next, &domain->group_list);
 
+	/*
+	 * to advertise safe interrupts either the IOMMU or the MSI controllers
+	 * must support IRQ remapping/interrupt translation
+	 */
 	if (!allow_unsafe_interrupts &&
-	    !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
+	    (!iommu_capable(bus, IOMMU_CAP_INTR_REMAP) && !safe_irq_domains)) {
 		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
 		       __func__);
 		ret = -EPERM;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 3/5] vfio/type1: also check IRQ remapping capability at msi domain
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: linux-arm-kernel

On x86 IRQ remapping is abstracted by the IOMMU. On ARM this is abstracted
by the msi controller. vfio_safe_irq_domain allows to check whether
interrupts are "safe" for a given device. They are if the device does
not use MSI or if the device uses MSI and the msi-parent controller
supports IRQ remapping.

Then we check at group level if all devices have safe interrupts: if not,
we only allow the group to be attached if allow_unsafe_interrupts is set.

At this point ARM sMMU still advertises IOMMU_CAP_INTR_REMAP. This is
changed in next patch.

Signed-off-by: Eric Auger <eric.auger@linaro.org>

---
v3 -> v4:
- rename vfio_msi_parent_irq_remapping_capable into vfio_safe_irq_domain
  and irq_remapping into safe_irq_domains

v2 -> v3:
- protect vfio_msi_parent_irq_remapping_capable with
  CONFIG_GENERIC_MSI_IRQ_DOMAIN
---
 drivers/vfio/vfio_iommu_type1.c | 44 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4497b20..b330b81 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -37,6 +37,8 @@
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
 #include <linux/dma-reserved-iommu.h>
+#include <linux/irqdomain.h>
+#include <linux/msi.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -788,6 +790,33 @@ static int vfio_bus_type(struct device *dev, void *data)
 	return 0;
 }
 
+/**
+ * vfio_safe_irq_domain: returns whether the irq domain
+ * the device is attached to is safe with respect to MSI isolation.
+ * If the irq domain is not an MSI domain, we return it is safe.
+ *
+ * @dev: device handle
+ * @data: unused
+ * returns 0 if the irq domain is safe, -1 if not.
+ */
+static int vfio_safe_irq_domain(struct device *dev, void *data)
+{
+#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+	struct irq_domain *domain;
+	struct msi_domain_info *info;
+
+	domain = dev_get_msi_domain(dev);
+	if (!domain)
+		return 0;
+
+	info = msi_get_domain_info(domain);
+
+	if (!(info->flags & MSI_FLAG_IRQ_REMAPPING))
+		return -1;
+#endif
+	return 0;
+}
+
 static int vfio_iommu_replay(struct vfio_iommu *iommu,
 			     struct vfio_domain *domain)
 {
@@ -882,7 +911,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	struct vfio_group *group, *g;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
-	int ret;
+	int ret, safe_irq_domains;
 
 	mutex_lock(&iommu->lock);
 
@@ -905,6 +934,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	group->iommu_group = iommu_group;
 
+	/*
+	 * Determine if all the devices of the group have a safe irq domain
+	 * with respect to MSI isolation
+	 */
+	safe_irq_domains = !iommu_group_for_each_dev(iommu_group, &bus,
+				       vfio_safe_irq_domain);
+
 	/* Determine bus_type in order to allocate a domain */
 	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
 	if (ret)
@@ -932,8 +968,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	INIT_LIST_HEAD(&domain->group_list);
 	list_add(&group->next, &domain->group_list);
 
+	/*
+	 * to advertise safe interrupts either the IOMMU or the MSI controllers
+	 * must support IRQ remapping/interrupt translation
+	 */
 	if (!allow_unsafe_interrupts &&
-	    !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
+	    (!iommu_capable(bus, IOMMU_CAP_INTR_REMAP) && !safe_irq_domains)) {
 		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
 		       __func__);
 		ret = -EPERM;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 4/5] iommu/arm-smmu: do not advertise IOMMU_CAP_INTR_REMAP
  2016-04-04  8:30 ` Eric Auger
  (?)
@ 2016-04-04  8:30   ` Eric Auger
  -1 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

Do not advertise IOMMU_CAP_INTR_REMAP for arm-smmu. Indeed the
irq_remapping capability is abstracted on irqchip side for ARM as
opposed to Intel IOMMU featuring IRQ remapping HW.

So to check IRQ remapping capability, the msi domain needs to be
checked instead.

This commit needs to be applied after "vfio/type1: also check IRQ
remapping capability at msi domain" else the legacy interrupt
assignment gets broken with arm-smmu.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/iommu/arm-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 6562752..83d5200 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1304,7 +1304,7 @@ static bool arm_smmu_capable(enum iommu_cap cap)
 		 */
 		return true;
 	case IOMMU_CAP_INTR_REMAP:
-		return true; /* MSIs are just memory writes */
+		return false; /* interrupt translation handled at MSI controller level */
 	case IOMMU_CAP_NOEXEC:
 		return true;
 	default:
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 4/5] iommu/arm-smmu: do not advertise IOMMU_CAP_INTR_REMAP
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: patches, Manish.Jaggi, linux-kernel, iommu

Do not advertise IOMMU_CAP_INTR_REMAP for arm-smmu. Indeed the
irq_remapping capability is abstracted on irqchip side for ARM as
opposed to Intel IOMMU featuring IRQ remapping HW.

So to check IRQ remapping capability, the msi domain needs to be
checked instead.

This commit needs to be applied after "vfio/type1: also check IRQ
remapping capability at msi domain" else the legacy interrupt
assignment gets broken with arm-smmu.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/iommu/arm-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 6562752..83d5200 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1304,7 +1304,7 @@ static bool arm_smmu_capable(enum iommu_cap cap)
 		 */
 		return true;
 	case IOMMU_CAP_INTR_REMAP:
-		return true; /* MSIs are just memory writes */
+		return false; /* interrupt translation handled at MSI controller level */
 	case IOMMU_CAP_NOEXEC:
 		return true;
 	default:
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 4/5] iommu/arm-smmu: do not advertise IOMMU_CAP_INTR_REMAP
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: linux-arm-kernel

Do not advertise IOMMU_CAP_INTR_REMAP for arm-smmu. Indeed the
irq_remapping capability is abstracted on irqchip side for ARM as
opposed to Intel IOMMU featuring IRQ remapping HW.

So to check IRQ remapping capability, the msi domain needs to be
checked instead.

This commit needs to be applied after "vfio/type1: also check IRQ
remapping capability at msi domain" else the legacy interrupt
assignment gets broken with arm-smmu.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/iommu/arm-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 6562752..83d5200 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1304,7 +1304,7 @@ static bool arm_smmu_capable(enum iommu_cap cap)
 		 */
 		return true;
 	case IOMMU_CAP_INTR_REMAP:
-		return true; /* MSIs are just memory writes */
+		return false; /* interrupt translation handled at MSI controller level */
 	case IOMMU_CAP_NOEXEC:
 		return true;
 	default:
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm
  Cc: suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

This patch allows the user-space to know whether MSI addresses need to
be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.

Also the number of IOMMU pages requested to map those is returned in
msi_iova_pages field. User-space must use this information to allocate an
IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).

Signed-off-by: Eric Auger <eric.auger@linaro.org>

---

Currently it is assumed a single doorbell page is used per MSI controller.
This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
If an MSI controller were to expose more doorbells it could implement a
new callback at irq_chip interface.

v4 -> v5:
- move msi_info and ret declaration within the conditional code

v3 -> v4:
- replace former vfio_domains_require_msi_mapping by
  more complex computation of MSI mapping requirements, especially the
  number of pages to be provided by the user-space.
- reword patch title

RFC v1 -> v1:
- derived from
  [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
- renamed allow_msi_reconfig into require_msi_mapping
- fixed VFIO_IOMMU_GET_INFO
---
 drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |   2 +
 2 files changed, 149 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b330b81..f1def50 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -39,6 +39,7 @@
 #include <linux/dma-reserved-iommu.h>
 #include <linux/irqdomain.h>
 #include <linux/msi.h>
+#include <linux/irq.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -95,6 +96,17 @@ struct vfio_group {
 	struct list_head	next;
 };
 
+struct vfio_irq_chip {
+	struct list_head next;
+	struct irq_chip *chip;
+};
+
+struct vfio_msi_map_info {
+	bool mapping_required;
+	unsigned int iova_pages;
+	struct list_head irq_chip_list;
+};
+
 /*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
@@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 	return ret;
 }
 
+#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
+/**
+ * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
+ * the @dev device requirements.
+ *
+ * @dev: device handle
+ * @data: opaque pointing to a struct vfio_msi_map_info
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
+{
+	struct irq_domain *domain;
+	struct msi_domain_info *info;
+	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
+	struct irq_chip *chip;
+	struct vfio_irq_chip *iter, *new;
+
+	domain = dev_get_msi_domain(dev);
+	if (!domain)
+		return 0;
+
+	/* Let's compute the needs for the MSI domain */
+	info = msi_get_domain_info(domain);
+	chip = info->chip;
+	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
+		if (iter->chip == chip)
+			return 0;
+	}
+
+	new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+
+	new->chip = chip;
+
+	list_add(&new->next, &msi_info->irq_chip_list);
+
+	/*
+	 * new irq_chip to be taken into account; we currently assume
+	 * a single iova doorbell by irq chip requesting MSI mapping
+	 */
+	msi_info->iova_pages += 1;
+	return 0;
+}
+
+/**
+ * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
+ * for vfio_domain @d
+ *
+ * @d: vfio domain handle
+ * @data: opaque pointing to a struct vfio_msi_map_info
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
+{
+	int ret = 0;
+	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
+	struct vfio_irq_chip *iter, *tmp;
+	struct vfio_group *g;
+
+	msi_info->iova_pages = 0;
+	INIT_LIST_HEAD(&msi_info->irq_chip_list);
+
+	if (iommu_domain_get_attr(d->domain,
+				   DOMAIN_ATTR_MSI_MAPPING, NULL))
+		return 0;
+	msi_info->mapping_required = true;
+	list_for_each_entry(g, &d->group_list, next) {
+		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
+			   vfio_dev_compute_msi_map_info);
+		if (ret)
+			goto out;
+	}
+out:
+	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
+		list_del(&iter->next);
+		kfree(iter);
+	}
+	return ret;
+}
+
+/**
+ * vfio_compute_msi_map_info: compute MSI mapping requirements
+ *
+ * Do some MSI addresses need to be mapped? IOMMU page size?
+ * Max number of IOVA pages needed by any domain to map MSI
+ *
+ * @iommu: iommu handle
+ * @info: msi map info handle
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
+				 struct vfio_msi_map_info *msi_info)
+{
+	int ret = 0;
+	struct vfio_domain *d;
+	unsigned long bitmap = ULONG_MAX;
+	unsigned int iova_pages = 0;
+
+	msi_info->mapping_required = false;
+
+	mutex_lock(&iommu->lock);
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		bitmap &= d->domain->ops->pgsize_bitmap;
+		ret = vfio_domain_compute_msi_map_info(d, msi_info);
+		if (ret)
+			goto out;
+		if (msi_info->iova_pages > iova_pages)
+			iova_pages = msi_info->iova_pages;
+	}
+out:
+	msi_info->iova_pages = iova_pages;
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+#endif
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		info.flags = VFIO_IOMMU_INFO_PGSIZES;
 
+#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
+		{
+			struct vfio_msi_map_info msi_info;
+			int ret;
+
+			ret = vfio_compute_msi_map_info(iommu, &msi_info);
+			if (ret)
+				return ret;
+
+			if (msi_info.mapping_required)
+				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
+			info.msi_iova_pages = msi_info.iova_pages;
+		}
+#endif
 		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
 		return copy_to_user((void __user *)arg, &info, minsz) ?
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a49be8a..e3e501c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
 	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
 };
 
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: eric.auger-qxv4g6HH51o, eric.auger-QSEj5FYQhm4dnm+yROfE0A,
	robin.murphy-5wv7dgnIgG8, alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, jason-NLaQJdtUoK4Be96aLqz0jA,
	marc.zyngier-5wv7dgnIgG8,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	kvm-u79uwXL29TY76Z2rM5mHXA
  Cc: julien.grall-5wv7dgnIgG8, patches-QSEj5FYQhm4dnm+yROfE0A,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w

This patch allows the user-space to know whether MSI addresses need to
be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.

Also the number of IOMMU pages requested to map those is returned in
msi_iova_pages field. User-space must use this information to allocate an
IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).

Signed-off-by: Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

---

Currently it is assumed a single doorbell page is used per MSI controller.
This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
If an MSI controller were to expose more doorbells it could implement a
new callback at irq_chip interface.

v4 -> v5:
- move msi_info and ret declaration within the conditional code

v3 -> v4:
- replace former vfio_domains_require_msi_mapping by
  more complex computation of MSI mapping requirements, especially the
  number of pages to be provided by the user-space.
- reword patch title

RFC v1 -> v1:
- derived from
  [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
- renamed allow_msi_reconfig into require_msi_mapping
- fixed VFIO_IOMMU_GET_INFO
---
 drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |   2 +
 2 files changed, 149 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b330b81..f1def50 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -39,6 +39,7 @@
 #include <linux/dma-reserved-iommu.h>
 #include <linux/irqdomain.h>
 #include <linux/msi.h>
+#include <linux/irq.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
@@ -95,6 +96,17 @@ struct vfio_group {
 	struct list_head	next;
 };
 
+struct vfio_irq_chip {
+	struct list_head next;
+	struct irq_chip *chip;
+};
+
+struct vfio_msi_map_info {
+	bool mapping_required;
+	unsigned int iova_pages;
+	struct list_head irq_chip_list;
+};
+
 /*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
@@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 	return ret;
 }
 
+#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
+/**
+ * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
+ * the @dev device requirements.
+ *
+ * @dev: device handle
+ * @data: opaque pointing to a struct vfio_msi_map_info
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
+{
+	struct irq_domain *domain;
+	struct msi_domain_info *info;
+	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
+	struct irq_chip *chip;
+	struct vfio_irq_chip *iter, *new;
+
+	domain = dev_get_msi_domain(dev);
+	if (!domain)
+		return 0;
+
+	/* Let's compute the needs for the MSI domain */
+	info = msi_get_domain_info(domain);
+	chip = info->chip;
+	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
+		if (iter->chip == chip)
+			return 0;
+	}
+
+	new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+
+	new->chip = chip;
+
+	list_add(&new->next, &msi_info->irq_chip_list);
+
+	/*
+	 * new irq_chip to be taken into account; we currently assume
+	 * a single iova doorbell by irq chip requesting MSI mapping
+	 */
+	msi_info->iova_pages += 1;
+	return 0;
+}
+
+/**
+ * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
+ * for vfio_domain @d
+ *
+ * @d: vfio domain handle
+ * @data: opaque pointing to a struct vfio_msi_map_info
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
+{
+	int ret = 0;
+	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
+	struct vfio_irq_chip *iter, *tmp;
+	struct vfio_group *g;
+
+	msi_info->iova_pages = 0;
+	INIT_LIST_HEAD(&msi_info->irq_chip_list);
+
+	if (iommu_domain_get_attr(d->domain,
+				   DOMAIN_ATTR_MSI_MAPPING, NULL))
+		return 0;
+	msi_info->mapping_required = true;
+	list_for_each_entry(g, &d->group_list, next) {
+		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
+			   vfio_dev_compute_msi_map_info);
+		if (ret)
+			goto out;
+	}
+out:
+	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
+		list_del(&iter->next);
+		kfree(iter);
+	}
+	return ret;
+}
+
+/**
+ * vfio_compute_msi_map_info: compute MSI mapping requirements
+ *
+ * Do some MSI addresses need to be mapped? IOMMU page size?
+ * Max number of IOVA pages needed by any domain to map MSI
+ *
+ * @iommu: iommu handle
+ * @info: msi map info handle
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
+				 struct vfio_msi_map_info *msi_info)
+{
+	int ret = 0;
+	struct vfio_domain *d;
+	unsigned long bitmap = ULONG_MAX;
+	unsigned int iova_pages = 0;
+
+	msi_info->mapping_required = false;
+
+	mutex_lock(&iommu->lock);
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		bitmap &= d->domain->ops->pgsize_bitmap;
+		ret = vfio_domain_compute_msi_map_info(d, msi_info);
+		if (ret)
+			goto out;
+		if (msi_info->iova_pages > iova_pages)
+			iova_pages = msi_info->iova_pages;
+	}
+out:
+	msi_info->iova_pages = iova_pages;
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+#endif
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		info.flags = VFIO_IOMMU_INFO_PGSIZES;
 
+#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
+		{
+			struct vfio_msi_map_info msi_info;
+			int ret;
+
+			ret = vfio_compute_msi_map_info(iommu, &msi_info);
+			if (ret)
+				return ret;
+
+			if (msi_info.mapping_required)
+				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
+			info.msi_iova_pages = msi_info.iova_pages;
+		}
+#endif
 		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
 		return copy_to_user((void __user *)arg, &info, minsz) ?
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a49be8a..e3e501c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
 	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
 };
 
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-04  8:30   ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  8:30 UTC (permalink / raw)
  To: linux-arm-kernel

This patch allows the user-space to know whether MSI addresses need to
be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.

Also the number of IOMMU pages requested to map those is returned in
msi_iova_pages field. User-space must use this information to allocate an
IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).

Signed-off-by: Eric Auger <eric.auger@linaro.org>

---

Currently it is assumed a single doorbell page is used per MSI controller.
This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
If an MSI controller were to expose more doorbells it could implement a
new callback at irq_chip interface.

v4 -> v5:
- move msi_info and ret declaration within the conditional code

v3 -> v4:
- replace former vfio_domains_require_msi_mapping by
  more complex computation of MSI mapping requirements, especially the
  number of pages to be provided by the user-space.
- reword patch title

RFC v1 -> v1:
- derived from
  [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
- renamed allow_msi_reconfig into require_msi_mapping
- fixed VFIO_IOMMU_GET_INFO
---
 drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |   2 +
 2 files changed, 149 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b330b81..f1def50 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -39,6 +39,7 @@
 #include <linux/dma-reserved-iommu.h>
 #include <linux/irqdomain.h>
 #include <linux/msi.h>
+#include <linux/irq.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -95,6 +96,17 @@ struct vfio_group {
 	struct list_head	next;
 };
 
+struct vfio_irq_chip {
+	struct list_head next;
+	struct irq_chip *chip;
+};
+
+struct vfio_msi_map_info {
+	bool mapping_required;
+	unsigned int iova_pages;
+	struct list_head irq_chip_list;
+};
+
 /*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
@@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 	return ret;
 }
 
+#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
+/**
+ * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
+ * the @dev device requirements.
+ *
+ * @dev: device handle
+ * @data: opaque pointing to a struct vfio_msi_map_info
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
+{
+	struct irq_domain *domain;
+	struct msi_domain_info *info;
+	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
+	struct irq_chip *chip;
+	struct vfio_irq_chip *iter, *new;
+
+	domain = dev_get_msi_domain(dev);
+	if (!domain)
+		return 0;
+
+	/* Let's compute the needs for the MSI domain */
+	info = msi_get_domain_info(domain);
+	chip = info->chip;
+	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
+		if (iter->chip == chip)
+			return 0;
+	}
+
+	new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+
+	new->chip = chip;
+
+	list_add(&new->next, &msi_info->irq_chip_list);
+
+	/*
+	 * new irq_chip to be taken into account; we currently assume
+	 * a single iova doorbell by irq chip requesting MSI mapping
+	 */
+	msi_info->iova_pages += 1;
+	return 0;
+}
+
+/**
+ * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
+ * for vfio_domain @d
+ *
+ * @d: vfio domain handle
+ * @data: opaque pointing to a struct vfio_msi_map_info
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
+{
+	int ret = 0;
+	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
+	struct vfio_irq_chip *iter, *tmp;
+	struct vfio_group *g;
+
+	msi_info->iova_pages = 0;
+	INIT_LIST_HEAD(&msi_info->irq_chip_list);
+
+	if (iommu_domain_get_attr(d->domain,
+				   DOMAIN_ATTR_MSI_MAPPING, NULL))
+		return 0;
+	msi_info->mapping_required = true;
+	list_for_each_entry(g, &d->group_list, next) {
+		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
+			   vfio_dev_compute_msi_map_info);
+		if (ret)
+			goto out;
+	}
+out:
+	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
+		list_del(&iter->next);
+		kfree(iter);
+	}
+	return ret;
+}
+
+/**
+ * vfio_compute_msi_map_info: compute MSI mapping requirements
+ *
+ * Do some MSI addresses need to be mapped? IOMMU page size?
+ * Max number of IOVA pages needed by any domain to map MSI
+ *
+ * @iommu: iommu handle
+ * @info: msi map info handle
+ *
+ * returns 0 upon success or -ENOMEM
+ */
+static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
+				 struct vfio_msi_map_info *msi_info)
+{
+	int ret = 0;
+	struct vfio_domain *d;
+	unsigned long bitmap = ULONG_MAX;
+	unsigned int iova_pages = 0;
+
+	msi_info->mapping_required = false;
+
+	mutex_lock(&iommu->lock);
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		bitmap &= d->domain->ops->pgsize_bitmap;
+		ret = vfio_domain_compute_msi_map_info(d, msi_info);
+		if (ret)
+			goto out;
+		if (msi_info->iova_pages > iova_pages)
+			iova_pages = msi_info->iova_pages;
+	}
+out:
+	msi_info->iova_pages = iova_pages;
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+#endif
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		info.flags = VFIO_IOMMU_INFO_PGSIZES;
 
+#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
+		{
+			struct vfio_msi_map_info msi_info;
+			int ret;
+
+			ret = vfio_compute_msi_map_info(iommu, &msi_info);
+			if (ret)
+				return ret;
+
+			if (msi_info.mapping_required)
+				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
+			info.msi_iova_pages = msi_info.iova_pages;
+		}
+#endif
 		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
 		return copy_to_user((void __user *)arg, &info, minsz) ?
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a49be8a..e3e501c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
 	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
 };
 
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  9:30     ` kbuild test robot
  0 siblings, 0 replies; 48+ messages in thread
From: kbuild test robot @ 2016-04-04  9:30 UTC (permalink / raw)
  To: Eric Auger
  Cc: kbuild-all, eric.auger, eric.auger, robin.murphy,
	alex.williamson, will.deacon, joro, tglx, jason, marc.zyngier,
	christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

[-- Attachment #1: Type: text/plain, Size: 1316 bytes --]

Hi Eric,

[auto build test ERROR on iommu/next]
[also build test ERROR on v4.6-rc2 next-20160404]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64-kernel-part-3-3-vfio-changes/20160404-163335
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: x86_64-randconfig-s1-04041632 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> drivers/vfio/vfio_iommu_type1.c:39:38: fatal error: linux/dma-reserved-iommu.h: No such file or directory
   compilation terminated.

vim +39 drivers/vfio/vfio_iommu_type1.c

    33	#include <linux/rbtree.h>
    34	#include <linux/sched.h>
    35	#include <linux/slab.h>
    36	#include <linux/uaccess.h>
    37	#include <linux/vfio.h>
    38	#include <linux/workqueue.h>
  > 39	#include <linux/dma-reserved-iommu.h>
    40	
    41	#define DRIVER_VERSION  "0.2"
    42	#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 27844 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  9:30     ` kbuild test robot
  0 siblings, 0 replies; 48+ messages in thread
From: kbuild test robot @ 2016-04-04  9:30 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger-qxv4g6HH51o, kvm-u79uwXL29TY76Z2rM5mHXA,
	will.deacon-5wv7dgnIgG8, julien.grall-5wv7dgnIgG8,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	eric.auger-QSEj5FYQhm4dnm+yROfE0A,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, jason-NLaQJdtUoK4Be96aLqz0jA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	kbuild-all-JC7UmRfGjtg, christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

[-- Attachment #1: Type: text/plain, Size: 1345 bytes --]

Hi Eric,

[auto build test ERROR on iommu/next]
[also build test ERROR on v4.6-rc2 next-20160404]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64-kernel-part-3-3-vfio-changes/20160404-163335
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: x86_64-randconfig-s1-04041632 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> drivers/vfio/vfio_iommu_type1.c:39:38: fatal error: linux/dma-reserved-iommu.h: No such file or directory
   compilation terminated.

vim +39 drivers/vfio/vfio_iommu_type1.c

    33	#include <linux/rbtree.h>
    34	#include <linux/sched.h>
    35	#include <linux/slab.h>
    36	#include <linux/uaccess.h>
    37	#include <linux/vfio.h>
    38	#include <linux/workqueue.h>
  > 39	#include <linux/dma-reserved-iommu.h>
    40	
    41	#define DRIVER_VERSION  "0.2"
    42	#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 27844 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  9:30     ` kbuild test robot
  0 siblings, 0 replies; 48+ messages in thread
From: kbuild test robot @ 2016-04-04  9:30 UTC (permalink / raw)
  Cc: eric.auger-qxv4g6HH51o, kvm-u79uwXL29TY76Z2rM5mHXA,
	will.deacon-5wv7dgnIgG8, julien.grall-5wv7dgnIgG8,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	eric.auger-QSEj5FYQhm4dnm+yROfE0A,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, jason-NLaQJdtUoK4Be96aLqz0jA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	kbuild-all-JC7UmRfGjtg, christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

[-- Attachment #1: Type: text/plain, Size: 1345 bytes --]

Hi Eric,

[auto build test ERROR on iommu/next]
[also build test ERROR on v4.6-rc2 next-20160404]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64-kernel-part-3-3-vfio-changes/20160404-163335
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: x86_64-randconfig-s1-04041632 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> drivers/vfio/vfio_iommu_type1.c:39:38: fatal error: linux/dma-reserved-iommu.h: No such file or directory
   compilation terminated.

vim +39 drivers/vfio/vfio_iommu_type1.c

    33	#include <linux/rbtree.h>
    34	#include <linux/sched.h>
    35	#include <linux/slab.h>
    36	#include <linux/uaccess.h>
    37	#include <linux/vfio.h>
    38	#include <linux/workqueue.h>
  > 39	#include <linux/dma-reserved-iommu.h>
    40	
    41	#define DRIVER_VERSION  "0.2"
    42	#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 27844 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  9:30     ` kbuild test robot
  0 siblings, 0 replies; 48+ messages in thread
From: kbuild test robot @ 2016-04-04  9:30 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Eric,

[auto build test ERROR on iommu/next]
[also build test ERROR on v4.6-rc2 next-20160404]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64-kernel-part-3-3-vfio-changes/20160404-163335
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: x86_64-randconfig-s1-04041632 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> drivers/vfio/vfio_iommu_type1.c:39:38: fatal error: linux/dma-reserved-iommu.h: No such file or directory
   compilation terminated.

vim +39 drivers/vfio/vfio_iommu_type1.c

    33	#include <linux/rbtree.h>
    34	#include <linux/sched.h>
    35	#include <linux/slab.h>
    36	#include <linux/uaccess.h>
    37	#include <linux/vfio.h>
    38	#include <linux/workqueue.h>
  > 39	#include <linux/dma-reserved-iommu.h>
    40	
    41	#define DRIVER_VERSION  "0.2"
    42	#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/octet-stream
Size: 27844 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160404/8e356ccd/attachment-0001.obj>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  9:35       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  9:35 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, eric.auger, robin.murphy, alex.williamson,
	will.deacon, joro, tglx, jason, marc.zyngier, christoffer.dall,
	linux-arm-kernel, kvmarm, kvm, suravee.suthikulpanit, patches,
	linux-kernel, Manish.Jaggi, Bharat.Bhushan, pranav.sawargaonkar,
	p.fedin, iommu, Jean-Philippe.Brucker, julien.grall

On 04/04/2016 11:30 AM, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test ERROR on iommu/next]
> [also build test ERROR on v4.6-rc2 next-20160404]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64-kernel-part-3-3-vfio-changes/20160404-163335
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: x86_64-randconfig-s1-04041632 (attached as .config)
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
This series needs to be applied on top of
- part 2: [PATCH v6 0/4] KVM PCIe/MSI passthrough on ARM/ARM64: kernel
part 2/3: msi changes, https://lkml.org/lkml/2016/4/4/104
- part 1 ([PATCH v6 0/7] KVM PCIe/MSI passthrough on ARM/ARM64: kernel
part 1/3: iommu changes, https://lkml.org/lkml/2016/4/4/90)


Best Regards

Eric
> 
> All errors (new ones prefixed by >>):
> 
>>> drivers/vfio/vfio_iommu_type1.c:39:38: fatal error: linux/dma-reserved-iommu.h: No such file or directory
>    compilation terminated.
> 
> vim +39 drivers/vfio/vfio_iommu_type1.c
> 
>     33	#include <linux/rbtree.h>
>     34	#include <linux/sched.h>
>     35	#include <linux/slab.h>
>     36	#include <linux/uaccess.h>
>     37	#include <linux/vfio.h>
>     38	#include <linux/workqueue.h>
>   > 39	#include <linux/dma-reserved-iommu.h>
>     40	
>     41	#define DRIVER_VERSION  "0.2"
>     42	#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  9:35       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  9:35 UTC (permalink / raw)
  To: kbuild test robot
  Cc: eric.auger-qxv4g6HH51o, kvm-u79uwXL29TY76Z2rM5mHXA,
	will.deacon-5wv7dgnIgG8, julien.grall-5wv7dgnIgG8,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, jason-NLaQJdtUoK4Be96aLqz0jA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	kbuild-all-JC7UmRfGjtg, christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

On 04/04/2016 11:30 AM, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test ERROR on iommu/next]
> [also build test ERROR on v4.6-rc2 next-20160404]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64-kernel-part-3-3-vfio-changes/20160404-163335
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: x86_64-randconfig-s1-04041632 (attached as .config)
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
This series needs to be applied on top of
- part 2: [PATCH v6 0/4] KVM PCIe/MSI passthrough on ARM/ARM64: kernel
part 2/3: msi changes, https://lkml.org/lkml/2016/4/4/104
- part 1 ([PATCH v6 0/7] KVM PCIe/MSI passthrough on ARM/ARM64: kernel
part 1/3: iommu changes, https://lkml.org/lkml/2016/4/4/90)


Best Regards

Eric
> 
> All errors (new ones prefixed by >>):
> 
>>> drivers/vfio/vfio_iommu_type1.c:39:38: fatal error: linux/dma-reserved-iommu.h: No such file or directory
>    compilation terminated.
> 
> vim +39 drivers/vfio/vfio_iommu_type1.c
> 
>     33	#include <linux/rbtree.h>
>     34	#include <linux/sched.h>
>     35	#include <linux/slab.h>
>     36	#include <linux/uaccess.h>
>     37	#include <linux/vfio.h>
>     38	#include <linux/workqueue.h>
>   > 39	#include <linux/dma-reserved-iommu.h>
>     40	
>     41	#define DRIVER_VERSION  "0.2"
>     42	#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-04  9:35       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-04  9:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/04/2016 11:30 AM, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test ERROR on iommu/next]
> [also build test ERROR on v4.6-rc2 next-20160404]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64-kernel-part-3-3-vfio-changes/20160404-163335
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: x86_64-randconfig-s1-04041632 (attached as .config)
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
This series needs to be applied on top of
- part 2: [PATCH v6 0/4] KVM PCIe/MSI passthrough on ARM/ARM64: kernel
part 2/3: msi changes, https://lkml.org/lkml/2016/4/4/104
- part 1 ([PATCH v6 0/7] KVM PCIe/MSI passthrough on ARM/ARM64: kernel
part 1/3: iommu changes, https://lkml.org/lkml/2016/4/4/90)


Best Regards

Eric
> 
> All errors (new ones prefixed by >>):
> 
>>> drivers/vfio/vfio_iommu_type1.c:39:38: fatal error: linux/dma-reserved-iommu.h: No such file or directory
>    compilation terminated.
> 
> vim +39 drivers/vfio/vfio_iommu_type1.c
> 
>     33	#include <linux/rbtree.h>
>     34	#include <linux/sched.h>
>     35	#include <linux/slab.h>
>     36	#include <linux/uaccess.h>
>     37	#include <linux/vfio.h>
>     38	#include <linux/workqueue.h>
>   > 39	#include <linux/dma-reserved-iommu.h>
>     40	
>     41	#define DRIVER_VERSION  "0.2"
>     42	#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
  2016-04-04  8:30   ` Eric Auger
  (?)
@ 2016-04-06 22:07     ` Alex Williamson
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-06 22:07 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger, robin.murphy, will.deacon, joro, tglx, jason,
	marc.zyngier, christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

On Mon,  4 Apr 2016 08:30:08 +0000
Eric Auger <eric.auger@linaro.org> wrote:

> The user is allowed to [un]register a reserved IOVA range by using the
> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
> It provides the base address and the size. This region is stored in the
> vfio_dma rb tree. At that point the iova range is not mapped to any target
> address yet. The host kernel will use those iova when needed, typically
> when the VFIO-PCI device allocates its MSIs.
> 
> This patch also handles the destruction of the reserved binding RB-tree and
> domain's iova_domains.
> 
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
> 
> ---
> v3 -> v4:
> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
> - protect vfio_register_reserved_iova_range implementation with
>   CONFIG_IOMMU_DMA_RESERVED
> - handle unregistration by user-space and on vfio_iommu_type1 release
> 
> v1 -> v2:
> - set returned value according to alloc_reserved_iova_domain result
> - free the iova domains in case any error occurs
> 
> RFC v1 -> v1:
> - takes into account Alex comments, based on
>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
> - use the existing dma map/unmap ioctl interface with a flag to register
>   a reserved IOVA range. A single reserved iova region is allowed.
> 
> Conflicts:
> 	drivers/vfio/vfio_iommu_type1.c
> ---
>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h       |  12 +++-
>  2 files changed, 150 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c9ddbde..4497b20 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -36,6 +36,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
> +#include <linux/dma-reserved-iommu.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  	vfio_lock_acct(-unlocked);
>  }
>  
> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	struct vfio_domain *d;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next)
> +		iommu_unmap_reserved(d->domain);
> +#endif
> +}
> +
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>  		vfio_unmap_unpin(iommu, dma);
> +	else
> +		vfio_unmap_reserved(iommu);
>  	vfio_unlink_dma(iommu, dma);
>  	kfree(dma);
>  }

This makes me nervous, apparently we can add reserved mappings
individually, but we have absolutely no granularity on remove, so if we
remove one, we've removed them all even though we still have them
linked in our rb tree.  I see later that only one reserved region is
allowed, but that seems very short sighted, especially to impose that
on the user level API.

> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	 */
>  	if (iommu->v2) {
>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
> -		if (dma && dma->iova != unmap->iova) {
> +		if (dma && (dma->iova != unmap->iova ||
> +			   (dma->type == VFIO_IOVA_RESERVED))) {

This seems unnecessary, won't the reserved entries fall out in the
while loop below?

>  			ret = -EINVAL;
>  			goto unlock;
>  		}
> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	}
>  
>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> +		if (dma->type == VFIO_IOVA_RESERVED) {
> +			ret = -EINVAL;
> +			goto unlock;
> +		}

Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
do that if they've registered any reserved regions.  Seems like maybe
we should ignore it and continue instead of abort, but then we need to
change the parameters of vfio_find_dma() to get it to move on, or pass
the type to the function, which would prevent us from getting here in
the first place.

>  		if (!iommu->v2 && unmap->iova > dma->iova)
>  			break;
>  		unmapped += dma->size;
> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
> +			   struct vfio_iommu_type1_dma_map *map)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	dma_addr_t iova = map->iova;
> +	size_t size = map->size;
> +	uint64_t mask;
> +	struct vfio_dma *dma;
> +	int ret = 0;
> +	struct vfio_domain *d;
> +	unsigned long order;
> +
> +	/* Verify that none of our __u64 fields overflow */
> +	if (map->size != size || map->iova != iova)
> +		return -EINVAL;
> +
> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> +	mask = ((uint64_t)1 << order) - 1;
> +
> +	WARN_ON(mask & PAGE_MASK);
> +
> +	if (!size || (size | iova) & mask)
> +		return -EINVAL;
> +
> +	/* Don't allow IOVA address wrap */
> +	if (iova + size - 1 < iova)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (vfio_find_dma(iommu, iova, size)) {
> +		ret =  -EEXIST;
> +		goto out;
> +	}
> +
> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> +	if (!dma) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	dma->iova = iova;
> +	dma->size = size;
> +	dma->type = VFIO_IOVA_RESERVED;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next)
> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
> +							size, order);
> +
> +	if (ret) {
> +		list_for_each_entry(d, &iommu->domain_list, next)
> +			iommu_free_reserved_iova_domain(d->domain);
> +		goto out;
> +	}
> +
> +	vfio_link_dma(iommu, dma);
> +
> +out:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +#else /* CONFIG_IOMMU_DMA_RESERVED */
> +	return -ENODEV;
> +#endif
> +}
> +
> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
> +				struct vfio_iommu_type1_dma_unmap *unmap)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	dma_addr_t iova = unmap->iova;
> +	struct vfio_dma *dma;
> +	size_t size = unmap->size;
> +	uint64_t mask;
> +	unsigned long order;
> +
> +	/* Verify that none of our __u64 fields overflow */
> +	if (unmap->size != size || unmap->iova != iova)
> +		return;
> +
> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> +	mask = ((uint64_t)1 << order) - 1;
> +
> +	WARN_ON(mask & PAGE_MASK);
> +
> +	if (!size || (size | iova) & mask)
> +		return;
> +
> +	/* Don't allow IOVA address wrap */
> +	if (iova + size - 1 < iova)
> +		return;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	dma = vfio_find_dma(iommu, iova, size);
> +
> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
> +		unmap->size = 0;
> +		goto out;
> +	}
> +
> +	unmap->size =  dma->size;
> +	vfio_remove_dma(iommu, dma);
> +
> +out:
> +	mutex_unlock(&iommu->lock);
> +#endif

Having a find_dma that accepts a type and a remove_reserved here seems
like it might simplify things.

> +}
> +
>  static int vfio_bus_type(struct device *dev, void *data)
>  {
>  	struct bus_type **bus = data;
> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  	struct vfio_group *group, *group_tmp;
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
> +	vfio_unmap_reserved(iommu);

If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
need to handle reserved entries?  We might as well have a separate
vfio_remove_reserved_dma().

>  
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>  		struct vfio_iommu_type1_dma_map map;
>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
> -				VFIO_DMA_MAP_FLAG_WRITE;
> +				VFIO_DMA_MAP_FLAG_WRITE |
> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>  
>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>  
> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (map.argsz < minsz || map.flags & ~mask)
>  			return -EINVAL;
>  
> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
> +			return vfio_register_reserved_iova_range(iommu, &map);
> +
>  		return vfio_dma_do_map(iommu, &map);
>  
>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (unmap.argsz < minsz || unmap.flags)
>  			return -EINVAL;
>  
> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
> +			goto out;
> +		}
> +
>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>  		if (ret)
>  			return ret;
>  
> +out:
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
>  	}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 255a211..a49be8a 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>   *
>   * Map process virtual addresses to IO virtual addresses using the
>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> + *
> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
> + * IOVA region which will be used on some platforms to map the host MSI frame.
> + * in that specific case, vaddr and prot are ignored. The requirement for
> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
> + * MSI_RESERVED_IOVA region can be registered
>   */

Why do we ignore read/write flags?  I'm not sure how useful a read-only
reserved region might be, but certainly some platforms might support
write-only or read-write.  Isn't this something we should let the IOMMU
driver decide?  ie. pass it down and let it fail or not?  Also why are
we making it the API spec to only allow a single reserved region of
this type?  We could simply let additional ones fail, or better yet add
a capability to the info ioctl to indicate the number available and
then fail if the user exceeds it.

>  struct vfio_iommu_type1_dma_map {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> +/* reserved iova for MSI vectors*/
> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)

nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
reserved flags seems like it puts the precedence in order.

>  	__u64	vaddr;				/* Process virtual address */
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>   * Caller sets argsz.  The actual unmapped size is returned in the size
>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>   * or size different from those used in the original mapping call will
> - * succeed.
> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
> + * flag set.

So map/unmap become bi-modal, with this flag set they should only
operate on reserved entries, otherwise they should operate on legacy
entries.  So clearly as a user I should be able to continue doing an
unmap from 0-(-1) of legacy entries and not stumble over reserved
entries.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-06 22:07     ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-06 22:07 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger, jason, kvm, patches, marc.zyngier, joro, will.deacon,
	linux-kernel, Manish.Jaggi, iommu, linux-arm-kernel, tglx,
	robin.murphy, kvmarm

On Mon,  4 Apr 2016 08:30:08 +0000
Eric Auger <eric.auger@linaro.org> wrote:

> The user is allowed to [un]register a reserved IOVA range by using the
> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
> It provides the base address and the size. This region is stored in the
> vfio_dma rb tree. At that point the iova range is not mapped to any target
> address yet. The host kernel will use those iova when needed, typically
> when the VFIO-PCI device allocates its MSIs.
> 
> This patch also handles the destruction of the reserved binding RB-tree and
> domain's iova_domains.
> 
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
> 
> ---
> v3 -> v4:
> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
> - protect vfio_register_reserved_iova_range implementation with
>   CONFIG_IOMMU_DMA_RESERVED
> - handle unregistration by user-space and on vfio_iommu_type1 release
> 
> v1 -> v2:
> - set returned value according to alloc_reserved_iova_domain result
> - free the iova domains in case any error occurs
> 
> RFC v1 -> v1:
> - takes into account Alex comments, based on
>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
> - use the existing dma map/unmap ioctl interface with a flag to register
>   a reserved IOVA range. A single reserved iova region is allowed.
> 
> Conflicts:
> 	drivers/vfio/vfio_iommu_type1.c
> ---
>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h       |  12 +++-
>  2 files changed, 150 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c9ddbde..4497b20 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -36,6 +36,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
> +#include <linux/dma-reserved-iommu.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  	vfio_lock_acct(-unlocked);
>  }
>  
> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	struct vfio_domain *d;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next)
> +		iommu_unmap_reserved(d->domain);
> +#endif
> +}
> +
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>  		vfio_unmap_unpin(iommu, dma);
> +	else
> +		vfio_unmap_reserved(iommu);
>  	vfio_unlink_dma(iommu, dma);
>  	kfree(dma);
>  }

This makes me nervous, apparently we can add reserved mappings
individually, but we have absolutely no granularity on remove, so if we
remove one, we've removed them all even though we still have them
linked in our rb tree.  I see later that only one reserved region is
allowed, but that seems very short sighted, especially to impose that
on the user level API.

> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	 */
>  	if (iommu->v2) {
>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
> -		if (dma && dma->iova != unmap->iova) {
> +		if (dma && (dma->iova != unmap->iova ||
> +			   (dma->type == VFIO_IOVA_RESERVED))) {

This seems unnecessary, won't the reserved entries fall out in the
while loop below?

>  			ret = -EINVAL;
>  			goto unlock;
>  		}
> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	}
>  
>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> +		if (dma->type == VFIO_IOVA_RESERVED) {
> +			ret = -EINVAL;
> +			goto unlock;
> +		}

Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
do that if they've registered any reserved regions.  Seems like maybe
we should ignore it and continue instead of abort, but then we need to
change the parameters of vfio_find_dma() to get it to move on, or pass
the type to the function, which would prevent us from getting here in
the first place.

>  		if (!iommu->v2 && unmap->iova > dma->iova)
>  			break;
>  		unmapped += dma->size;
> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
> +			   struct vfio_iommu_type1_dma_map *map)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	dma_addr_t iova = map->iova;
> +	size_t size = map->size;
> +	uint64_t mask;
> +	struct vfio_dma *dma;
> +	int ret = 0;
> +	struct vfio_domain *d;
> +	unsigned long order;
> +
> +	/* Verify that none of our __u64 fields overflow */
> +	if (map->size != size || map->iova != iova)
> +		return -EINVAL;
> +
> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> +	mask = ((uint64_t)1 << order) - 1;
> +
> +	WARN_ON(mask & PAGE_MASK);
> +
> +	if (!size || (size | iova) & mask)
> +		return -EINVAL;
> +
> +	/* Don't allow IOVA address wrap */
> +	if (iova + size - 1 < iova)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (vfio_find_dma(iommu, iova, size)) {
> +		ret =  -EEXIST;
> +		goto out;
> +	}
> +
> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> +	if (!dma) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	dma->iova = iova;
> +	dma->size = size;
> +	dma->type = VFIO_IOVA_RESERVED;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next)
> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
> +							size, order);
> +
> +	if (ret) {
> +		list_for_each_entry(d, &iommu->domain_list, next)
> +			iommu_free_reserved_iova_domain(d->domain);
> +		goto out;
> +	}
> +
> +	vfio_link_dma(iommu, dma);
> +
> +out:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +#else /* CONFIG_IOMMU_DMA_RESERVED */
> +	return -ENODEV;
> +#endif
> +}
> +
> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
> +				struct vfio_iommu_type1_dma_unmap *unmap)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	dma_addr_t iova = unmap->iova;
> +	struct vfio_dma *dma;
> +	size_t size = unmap->size;
> +	uint64_t mask;
> +	unsigned long order;
> +
> +	/* Verify that none of our __u64 fields overflow */
> +	if (unmap->size != size || unmap->iova != iova)
> +		return;
> +
> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> +	mask = ((uint64_t)1 << order) - 1;
> +
> +	WARN_ON(mask & PAGE_MASK);
> +
> +	if (!size || (size | iova) & mask)
> +		return;
> +
> +	/* Don't allow IOVA address wrap */
> +	if (iova + size - 1 < iova)
> +		return;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	dma = vfio_find_dma(iommu, iova, size);
> +
> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
> +		unmap->size = 0;
> +		goto out;
> +	}
> +
> +	unmap->size =  dma->size;
> +	vfio_remove_dma(iommu, dma);
> +
> +out:
> +	mutex_unlock(&iommu->lock);
> +#endif

Having a find_dma that accepts a type and a remove_reserved here seems
like it might simplify things.

> +}
> +
>  static int vfio_bus_type(struct device *dev, void *data)
>  {
>  	struct bus_type **bus = data;
> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  	struct vfio_group *group, *group_tmp;
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
> +	vfio_unmap_reserved(iommu);

If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
need to handle reserved entries?  We might as well have a separate
vfio_remove_reserved_dma().

>  
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>  		struct vfio_iommu_type1_dma_map map;
>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
> -				VFIO_DMA_MAP_FLAG_WRITE;
> +				VFIO_DMA_MAP_FLAG_WRITE |
> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>  
>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>  
> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (map.argsz < minsz || map.flags & ~mask)
>  			return -EINVAL;
>  
> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
> +			return vfio_register_reserved_iova_range(iommu, &map);
> +
>  		return vfio_dma_do_map(iommu, &map);
>  
>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (unmap.argsz < minsz || unmap.flags)
>  			return -EINVAL;
>  
> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
> +			goto out;
> +		}
> +
>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>  		if (ret)
>  			return ret;
>  
> +out:
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
>  	}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 255a211..a49be8a 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>   *
>   * Map process virtual addresses to IO virtual addresses using the
>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> + *
> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
> + * IOVA region which will be used on some platforms to map the host MSI frame.
> + * in that specific case, vaddr and prot are ignored. The requirement for
> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
> + * MSI_RESERVED_IOVA region can be registered
>   */

Why do we ignore read/write flags?  I'm not sure how useful a read-only
reserved region might be, but certainly some platforms might support
write-only or read-write.  Isn't this something we should let the IOMMU
driver decide?  ie. pass it down and let it fail or not?  Also why are
we making it the API spec to only allow a single reserved region of
this type?  We could simply let additional ones fail, or better yet add
a capability to the info ioctl to indicate the number available and
then fail if the user exceeds it.

>  struct vfio_iommu_type1_dma_map {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> +/* reserved iova for MSI vectors*/
> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)

nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
reserved flags seems like it puts the precedence in order.

>  	__u64	vaddr;				/* Process virtual address */
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>   * Caller sets argsz.  The actual unmapped size is returned in the size
>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>   * or size different from those used in the original mapping call will
> - * succeed.
> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
> + * flag set.

So map/unmap become bi-modal, with this flag set they should only
operate on reserved entries, otherwise they should operate on legacy
entries.  So clearly as a user I should be able to continue doing an
unmap from 0-(-1) of legacy entries and not stumble over reserved
entries.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-06 22:07     ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-06 22:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon,  4 Apr 2016 08:30:08 +0000
Eric Auger <eric.auger@linaro.org> wrote:

> The user is allowed to [un]register a reserved IOVA range by using the
> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
> It provides the base address and the size. This region is stored in the
> vfio_dma rb tree. At that point the iova range is not mapped to any target
> address yet. The host kernel will use those iova when needed, typically
> when the VFIO-PCI device allocates its MSIs.
> 
> This patch also handles the destruction of the reserved binding RB-tree and
> domain's iova_domains.
> 
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
> 
> ---
> v3 -> v4:
> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
> - protect vfio_register_reserved_iova_range implementation with
>   CONFIG_IOMMU_DMA_RESERVED
> - handle unregistration by user-space and on vfio_iommu_type1 release
> 
> v1 -> v2:
> - set returned value according to alloc_reserved_iova_domain result
> - free the iova domains in case any error occurs
> 
> RFC v1 -> v1:
> - takes into account Alex comments, based on
>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
> - use the existing dma map/unmap ioctl interface with a flag to register
>   a reserved IOVA range. A single reserved iova region is allowed.
> 
> Conflicts:
> 	drivers/vfio/vfio_iommu_type1.c
> ---
>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h       |  12 +++-
>  2 files changed, 150 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c9ddbde..4497b20 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -36,6 +36,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
> +#include <linux/dma-reserved-iommu.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  	vfio_lock_acct(-unlocked);
>  }
>  
> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	struct vfio_domain *d;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next)
> +		iommu_unmap_reserved(d->domain);
> +#endif
> +}
> +
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>  		vfio_unmap_unpin(iommu, dma);
> +	else
> +		vfio_unmap_reserved(iommu);
>  	vfio_unlink_dma(iommu, dma);
>  	kfree(dma);
>  }

This makes me nervous, apparently we can add reserved mappings
individually, but we have absolutely no granularity on remove, so if we
remove one, we've removed them all even though we still have them
linked in our rb tree.  I see later that only one reserved region is
allowed, but that seems very short sighted, especially to impose that
on the user level API.

> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	 */
>  	if (iommu->v2) {
>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
> -		if (dma && dma->iova != unmap->iova) {
> +		if (dma && (dma->iova != unmap->iova ||
> +			   (dma->type == VFIO_IOVA_RESERVED))) {

This seems unnecessary, won't the reserved entries fall out in the
while loop below?

>  			ret = -EINVAL;
>  			goto unlock;
>  		}
> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	}
>  
>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> +		if (dma->type == VFIO_IOVA_RESERVED) {
> +			ret = -EINVAL;
> +			goto unlock;
> +		}

Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
do that if they've registered any reserved regions.  Seems like maybe
we should ignore it and continue instead of abort, but then we need to
change the parameters of vfio_find_dma() to get it to move on, or pass
the type to the function, which would prevent us from getting here in
the first place.

>  		if (!iommu->v2 && unmap->iova > dma->iova)
>  			break;
>  		unmapped += dma->size;
> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
> +			   struct vfio_iommu_type1_dma_map *map)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	dma_addr_t iova = map->iova;
> +	size_t size = map->size;
> +	uint64_t mask;
> +	struct vfio_dma *dma;
> +	int ret = 0;
> +	struct vfio_domain *d;
> +	unsigned long order;
> +
> +	/* Verify that none of our __u64 fields overflow */
> +	if (map->size != size || map->iova != iova)
> +		return -EINVAL;
> +
> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> +	mask = ((uint64_t)1 << order) - 1;
> +
> +	WARN_ON(mask & PAGE_MASK);
> +
> +	if (!size || (size | iova) & mask)
> +		return -EINVAL;
> +
> +	/* Don't allow IOVA address wrap */
> +	if (iova + size - 1 < iova)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (vfio_find_dma(iommu, iova, size)) {
> +		ret =  -EEXIST;
> +		goto out;
> +	}
> +
> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> +	if (!dma) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	dma->iova = iova;
> +	dma->size = size;
> +	dma->type = VFIO_IOVA_RESERVED;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next)
> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
> +							size, order);
> +
> +	if (ret) {
> +		list_for_each_entry(d, &iommu->domain_list, next)
> +			iommu_free_reserved_iova_domain(d->domain);
> +		goto out;
> +	}
> +
> +	vfio_link_dma(iommu, dma);
> +
> +out:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +#else /* CONFIG_IOMMU_DMA_RESERVED */
> +	return -ENODEV;
> +#endif
> +}
> +
> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
> +				struct vfio_iommu_type1_dma_unmap *unmap)
> +{
> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> +	dma_addr_t iova = unmap->iova;
> +	struct vfio_dma *dma;
> +	size_t size = unmap->size;
> +	uint64_t mask;
> +	unsigned long order;
> +
> +	/* Verify that none of our __u64 fields overflow */
> +	if (unmap->size != size || unmap->iova != iova)
> +		return;
> +
> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> +	mask = ((uint64_t)1 << order) - 1;
> +
> +	WARN_ON(mask & PAGE_MASK);
> +
> +	if (!size || (size | iova) & mask)
> +		return;
> +
> +	/* Don't allow IOVA address wrap */
> +	if (iova + size - 1 < iova)
> +		return;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	dma = vfio_find_dma(iommu, iova, size);
> +
> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
> +		unmap->size = 0;
> +		goto out;
> +	}
> +
> +	unmap->size =  dma->size;
> +	vfio_remove_dma(iommu, dma);
> +
> +out:
> +	mutex_unlock(&iommu->lock);
> +#endif

Having a find_dma that accepts a type and a remove_reserved here seems
like it might simplify things.

> +}
> +
>  static int vfio_bus_type(struct device *dev, void *data)
>  {
>  	struct bus_type **bus = data;
> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  	struct vfio_group *group, *group_tmp;
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
> +	vfio_unmap_reserved(iommu);

If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
need to handle reserved entries?  We might as well have a separate
vfio_remove_reserved_dma().

>  
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>  		struct vfio_iommu_type1_dma_map map;
>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
> -				VFIO_DMA_MAP_FLAG_WRITE;
> +				VFIO_DMA_MAP_FLAG_WRITE |
> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>  
>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>  
> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (map.argsz < minsz || map.flags & ~mask)
>  			return -EINVAL;
>  
> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
> +			return vfio_register_reserved_iova_range(iommu, &map);
> +
>  		return vfio_dma_do_map(iommu, &map);
>  
>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (unmap.argsz < minsz || unmap.flags)
>  			return -EINVAL;
>  
> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
> +			goto out;
> +		}
> +
>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>  		if (ret)
>  			return ret;
>  
> +out:
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
>  	}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 255a211..a49be8a 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>   *
>   * Map process virtual addresses to IO virtual addresses using the
>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> + *
> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
> + * IOVA region which will be used on some platforms to map the host MSI frame.
> + * in that specific case, vaddr and prot are ignored. The requirement for
> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
> + * MSI_RESERVED_IOVA region can be registered
>   */

Why do we ignore read/write flags?  I'm not sure how useful a read-only
reserved region might be, but certainly some platforms might support
write-only or read-write.  Isn't this something we should let the IOMMU
driver decide?  ie. pass it down and let it fail or not?  Also why are
we making it the API spec to only allow a single reserved region of
this type?  We could simply let additional ones fail, or better yet add
a capability to the info ioctl to indicate the number available and
then fail if the user exceeds it.

>  struct vfio_iommu_type1_dma_map {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> +/* reserved iova for MSI vectors*/
> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)

nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
reserved flags seems like it puts the precedence in order.

>  	__u64	vaddr;				/* Process virtual address */
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>   * Caller sets argsz.  The actual unmapped size is returned in the size
>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>   * or size different from those used in the original mapping call will
> - * succeed.
> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
> + * flag set.

So map/unmap become bi-modal, with this flag set they should only
operate on reserved entries, otherwise they should operate on legacy
entries.  So clearly as a user I should be able to continue doing an
unmap from 0-(-1) of legacy entries and not stumble over reserved
entries.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-06 22:32     ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-06 22:32 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger, robin.murphy, will.deacon, joro, tglx, jason,
	marc.zyngier, christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

On Mon,  4 Apr 2016 08:30:11 +0000
Eric Auger <eric.auger@linaro.org> wrote:

> This patch allows the user-space to know whether MSI addresses need to
> be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
> IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.
> 
> Also the number of IOMMU pages requested to map those is returned in
> msi_iova_pages field. User-space must use this information to allocate an
> IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
> it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).
> 
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> 
> ---
> 
> Currently it is assumed a single doorbell page is used per MSI controller.
> This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
> If an MSI controller were to expose more doorbells it could implement a
> new callback at irq_chip interface.
> 
> v4 -> v5:
> - move msi_info and ret declaration within the conditional code
> 
> v3 -> v4:
> - replace former vfio_domains_require_msi_mapping by
>   more complex computation of MSI mapping requirements, especially the
>   number of pages to be provided by the user-space.
> - reword patch title
> 
> RFC v1 -> v1:
> - derived from
>   [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
> - renamed allow_msi_reconfig into require_msi_mapping
> - fixed VFIO_IOMMU_GET_INFO
> ---
>  drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |   2 +
>  2 files changed, 149 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index b330b81..f1def50 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -39,6 +39,7 @@
>  #include <linux/dma-reserved-iommu.h>
>  #include <linux/irqdomain.h>
>  #include <linux/msi.h>
> +#include <linux/irq.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -95,6 +96,17 @@ struct vfio_group {
>  	struct list_head	next;
>  };
>  
> +struct vfio_irq_chip {
> +	struct list_head next;
> +	struct irq_chip *chip;
> +};
> +
> +struct vfio_msi_map_info {
> +	bool mapping_required;
> +	unsigned int iova_pages;
> +	struct list_head irq_chip_list;
> +};
> +
>  /*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
> @@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  	return ret;
>  }
>  
> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
> +/**
> + * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
> + * the @dev device requirements.
> + *
> + * @dev: device handle
> + * @data: opaque pointing to a struct vfio_msi_map_info
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
> +{
> +	struct irq_domain *domain;
> +	struct msi_domain_info *info;
> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
> +	struct irq_chip *chip;
> +	struct vfio_irq_chip *iter, *new;
> +
> +	domain = dev_get_msi_domain(dev);
> +	if (!domain)
> +		return 0;
> +
> +	/* Let's compute the needs for the MSI domain */
> +	info = msi_get_domain_info(domain);
> +	chip = info->chip;
> +	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
> +		if (iter->chip == chip)
> +			return 0;
> +	}
> +
> +	new = kzalloc(sizeof(*new), GFP_KERNEL);
> +	if (!new)
> +		return -ENOMEM;
> +
> +	new->chip = chip;
> +
> +	list_add(&new->next, &msi_info->irq_chip_list);
> +
> +	/*
> +	 * new irq_chip to be taken into account; we currently assume
> +	 * a single iova doorbell by irq chip requesting MSI mapping
> +	 */
> +	msi_info->iova_pages += 1;
> +	return 0;
> +}
> +
> +/**
> + * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
> + * for vfio_domain @d
> + *
> + * @d: vfio domain handle
> + * @data: opaque pointing to a struct vfio_msi_map_info
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
> +{
> +	int ret = 0;
> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
> +	struct vfio_irq_chip *iter, *tmp;
> +	struct vfio_group *g;
> +
> +	msi_info->iova_pages = 0;
> +	INIT_LIST_HEAD(&msi_info->irq_chip_list);
> +
> +	if (iommu_domain_get_attr(d->domain,
> +				   DOMAIN_ATTR_MSI_MAPPING, NULL))
> +		return 0;
> +	msi_info->mapping_required = true;
> +	list_for_each_entry(g, &d->group_list, next) {
> +		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
> +			   vfio_dev_compute_msi_map_info);
> +		if (ret)
> +			goto out;
> +	}
> +out:
> +	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
> +		list_del(&iter->next);
> +		kfree(iter);
> +	}
> +	return ret;
> +}
> +
> +/**
> + * vfio_compute_msi_map_info: compute MSI mapping requirements
> + *
> + * Do some MSI addresses need to be mapped? IOMMU page size?
> + * Max number of IOVA pages needed by any domain to map MSI
> + *
> + * @iommu: iommu handle
> + * @info: msi map info handle
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
> +				 struct vfio_msi_map_info *msi_info)
> +{
> +	int ret = 0;
> +	struct vfio_domain *d;
> +	unsigned long bitmap = ULONG_MAX;
> +	unsigned int iova_pages = 0;
> +
> +	msi_info->mapping_required = false;
> +
> +	mutex_lock(&iommu->lock);
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		bitmap &= d->domain->ops->pgsize_bitmap;
> +		ret = vfio_domain_compute_msi_map_info(d, msi_info);
> +		if (ret)
> +			goto out;
> +		if (msi_info->iova_pages > iova_pages)
> +			iova_pages = msi_info->iova_pages;
> +	}
> +out:
> +	msi_info->iova_pages = iova_pages;
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +#endif
> +
>  /*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> @@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		info.flags = VFIO_IOMMU_INFO_PGSIZES;
>  
> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
> +		{
> +			struct vfio_msi_map_info msi_info;
> +			int ret;
> +
> +			ret = vfio_compute_msi_map_info(iommu, &msi_info);
> +			if (ret)
> +				return ret;
> +
> +			if (msi_info.mapping_required)
> +				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
> +			info.msi_iova_pages = msi_info.iova_pages;
> +		}
> +#endif
>  		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>  		return copy_to_user((void __user *)arg, &info, minsz) ?
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index a49be8a..e3e501c 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
> +#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
>  	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
> +	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
>  };
>  
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)

Take a look at the capability chain extensions I used for adding some
new capabilities for vfio regions and let me know why we shouldn't do
something similar for this info ioctl.  A fixed structure gets messy
almost instantly when we start adding new fields to it.  Thanks,

Alex

c84982a vfio: Define capability chains
d7a8d5e vfio: Add capability chain helpers
ff63eb6 vfio: Define sparse mmap capability for regions
188ad9d vfio/pci: Include sparse mmap capability for MSI-X table regions
c7bb4cb vfio: Define device specific region type capability
28541d4 vfio/pci: Add infrastructure for additional device specific regions

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-06 22:32     ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-06 22:32 UTC (permalink / raw)
  To: Eric Auger
  Cc: julien.grall-5wv7dgnIgG8, eric.auger-qxv4g6HH51o,
	jason-NLaQJdtUoK4Be96aLqz0jA, kvm-u79uwXL29TY76Z2rM5mHXA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, will.deacon-5wv7dgnIgG8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

On Mon,  4 Apr 2016 08:30:11 +0000
Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:

> This patch allows the user-space to know whether MSI addresses need to
> be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
> IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.
> 
> Also the number of IOMMU pages requested to map those is returned in
> msi_iova_pages field. User-space must use this information to allocate an
> IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
> it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).
> 
> Signed-off-by: Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> 
> ---
> 
> Currently it is assumed a single doorbell page is used per MSI controller.
> This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
> If an MSI controller were to expose more doorbells it could implement a
> new callback at irq_chip interface.
> 
> v4 -> v5:
> - move msi_info and ret declaration within the conditional code
> 
> v3 -> v4:
> - replace former vfio_domains_require_msi_mapping by
>   more complex computation of MSI mapping requirements, especially the
>   number of pages to be provided by the user-space.
> - reword patch title
> 
> RFC v1 -> v1:
> - derived from
>   [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
> - renamed allow_msi_reconfig into require_msi_mapping
> - fixed VFIO_IOMMU_GET_INFO
> ---
>  drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |   2 +
>  2 files changed, 149 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index b330b81..f1def50 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -39,6 +39,7 @@
>  #include <linux/dma-reserved-iommu.h>
>  #include <linux/irqdomain.h>
>  #include <linux/msi.h>
> +#include <linux/irq.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
> @@ -95,6 +96,17 @@ struct vfio_group {
>  	struct list_head	next;
>  };
>  
> +struct vfio_irq_chip {
> +	struct list_head next;
> +	struct irq_chip *chip;
> +};
> +
> +struct vfio_msi_map_info {
> +	bool mapping_required;
> +	unsigned int iova_pages;
> +	struct list_head irq_chip_list;
> +};
> +
>  /*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
> @@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  	return ret;
>  }
>  
> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
> +/**
> + * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
> + * the @dev device requirements.
> + *
> + * @dev: device handle
> + * @data: opaque pointing to a struct vfio_msi_map_info
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
> +{
> +	struct irq_domain *domain;
> +	struct msi_domain_info *info;
> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
> +	struct irq_chip *chip;
> +	struct vfio_irq_chip *iter, *new;
> +
> +	domain = dev_get_msi_domain(dev);
> +	if (!domain)
> +		return 0;
> +
> +	/* Let's compute the needs for the MSI domain */
> +	info = msi_get_domain_info(domain);
> +	chip = info->chip;
> +	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
> +		if (iter->chip == chip)
> +			return 0;
> +	}
> +
> +	new = kzalloc(sizeof(*new), GFP_KERNEL);
> +	if (!new)
> +		return -ENOMEM;
> +
> +	new->chip = chip;
> +
> +	list_add(&new->next, &msi_info->irq_chip_list);
> +
> +	/*
> +	 * new irq_chip to be taken into account; we currently assume
> +	 * a single iova doorbell by irq chip requesting MSI mapping
> +	 */
> +	msi_info->iova_pages += 1;
> +	return 0;
> +}
> +
> +/**
> + * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
> + * for vfio_domain @d
> + *
> + * @d: vfio domain handle
> + * @data: opaque pointing to a struct vfio_msi_map_info
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
> +{
> +	int ret = 0;
> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
> +	struct vfio_irq_chip *iter, *tmp;
> +	struct vfio_group *g;
> +
> +	msi_info->iova_pages = 0;
> +	INIT_LIST_HEAD(&msi_info->irq_chip_list);
> +
> +	if (iommu_domain_get_attr(d->domain,
> +				   DOMAIN_ATTR_MSI_MAPPING, NULL))
> +		return 0;
> +	msi_info->mapping_required = true;
> +	list_for_each_entry(g, &d->group_list, next) {
> +		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
> +			   vfio_dev_compute_msi_map_info);
> +		if (ret)
> +			goto out;
> +	}
> +out:
> +	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
> +		list_del(&iter->next);
> +		kfree(iter);
> +	}
> +	return ret;
> +}
> +
> +/**
> + * vfio_compute_msi_map_info: compute MSI mapping requirements
> + *
> + * Do some MSI addresses need to be mapped? IOMMU page size?
> + * Max number of IOVA pages needed by any domain to map MSI
> + *
> + * @iommu: iommu handle
> + * @info: msi map info handle
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
> +				 struct vfio_msi_map_info *msi_info)
> +{
> +	int ret = 0;
> +	struct vfio_domain *d;
> +	unsigned long bitmap = ULONG_MAX;
> +	unsigned int iova_pages = 0;
> +
> +	msi_info->mapping_required = false;
> +
> +	mutex_lock(&iommu->lock);
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		bitmap &= d->domain->ops->pgsize_bitmap;
> +		ret = vfio_domain_compute_msi_map_info(d, msi_info);
> +		if (ret)
> +			goto out;
> +		if (msi_info->iova_pages > iova_pages)
> +			iova_pages = msi_info->iova_pages;
> +	}
> +out:
> +	msi_info->iova_pages = iova_pages;
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +#endif
> +
>  /*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> @@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		info.flags = VFIO_IOMMU_INFO_PGSIZES;
>  
> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
> +		{
> +			struct vfio_msi_map_info msi_info;
> +			int ret;
> +
> +			ret = vfio_compute_msi_map_info(iommu, &msi_info);
> +			if (ret)
> +				return ret;
> +
> +			if (msi_info.mapping_required)
> +				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
> +			info.msi_iova_pages = msi_info.iova_pages;
> +		}
> +#endif
>  		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>  		return copy_to_user((void __user *)arg, &info, minsz) ?
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index a49be8a..e3e501c 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
> +#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
>  	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
> +	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
>  };
>  
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)

Take a look at the capability chain extensions I used for adding some
new capabilities for vfio regions and let me know why we shouldn't do
something similar for this info ioctl.  A fixed structure gets messy
almost instantly when we start adding new fields to it.  Thanks,

Alex

c84982a vfio: Define capability chains
d7a8d5e vfio: Add capability chain helpers
ff63eb6 vfio: Define sparse mmap capability for regions
188ad9d vfio/pci: Include sparse mmap capability for MSI-X table regions
c7bb4cb vfio: Define device specific region type capability
28541d4 vfio/pci: Add infrastructure for additional device specific regions

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-06 22:32     ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-06 22:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon,  4 Apr 2016 08:30:11 +0000
Eric Auger <eric.auger@linaro.org> wrote:

> This patch allows the user-space to know whether MSI addresses need to
> be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
> IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.
> 
> Also the number of IOMMU pages requested to map those is returned in
> msi_iova_pages field. User-space must use this information to allocate an
> IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
> it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).
> 
> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> 
> ---
> 
> Currently it is assumed a single doorbell page is used per MSI controller.
> This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
> If an MSI controller were to expose more doorbells it could implement a
> new callback at irq_chip interface.
> 
> v4 -> v5:
> - move msi_info and ret declaration within the conditional code
> 
> v3 -> v4:
> - replace former vfio_domains_require_msi_mapping by
>   more complex computation of MSI mapping requirements, especially the
>   number of pages to be provided by the user-space.
> - reword patch title
> 
> RFC v1 -> v1:
> - derived from
>   [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
> - renamed allow_msi_reconfig into require_msi_mapping
> - fixed VFIO_IOMMU_GET_INFO
> ---
>  drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |   2 +
>  2 files changed, 149 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index b330b81..f1def50 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -39,6 +39,7 @@
>  #include <linux/dma-reserved-iommu.h>
>  #include <linux/irqdomain.h>
>  #include <linux/msi.h>
> +#include <linux/irq.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -95,6 +96,17 @@ struct vfio_group {
>  	struct list_head	next;
>  };
>  
> +struct vfio_irq_chip {
> +	struct list_head next;
> +	struct irq_chip *chip;
> +};
> +
> +struct vfio_msi_map_info {
> +	bool mapping_required;
> +	unsigned int iova_pages;
> +	struct list_head irq_chip_list;
> +};
> +
>  /*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
> @@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  	return ret;
>  }
>  
> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
> +/**
> + * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
> + * the @dev device requirements.
> + *
> + * @dev: device handle
> + * @data: opaque pointing to a struct vfio_msi_map_info
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
> +{
> +	struct irq_domain *domain;
> +	struct msi_domain_info *info;
> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
> +	struct irq_chip *chip;
> +	struct vfio_irq_chip *iter, *new;
> +
> +	domain = dev_get_msi_domain(dev);
> +	if (!domain)
> +		return 0;
> +
> +	/* Let's compute the needs for the MSI domain */
> +	info = msi_get_domain_info(domain);
> +	chip = info->chip;
> +	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
> +		if (iter->chip == chip)
> +			return 0;
> +	}
> +
> +	new = kzalloc(sizeof(*new), GFP_KERNEL);
> +	if (!new)
> +		return -ENOMEM;
> +
> +	new->chip = chip;
> +
> +	list_add(&new->next, &msi_info->irq_chip_list);
> +
> +	/*
> +	 * new irq_chip to be taken into account; we currently assume
> +	 * a single iova doorbell by irq chip requesting MSI mapping
> +	 */
> +	msi_info->iova_pages += 1;
> +	return 0;
> +}
> +
> +/**
> + * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
> + * for vfio_domain @d
> + *
> + * @d: vfio domain handle
> + * @data: opaque pointing to a struct vfio_msi_map_info
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
> +{
> +	int ret = 0;
> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
> +	struct vfio_irq_chip *iter, *tmp;
> +	struct vfio_group *g;
> +
> +	msi_info->iova_pages = 0;
> +	INIT_LIST_HEAD(&msi_info->irq_chip_list);
> +
> +	if (iommu_domain_get_attr(d->domain,
> +				   DOMAIN_ATTR_MSI_MAPPING, NULL))
> +		return 0;
> +	msi_info->mapping_required = true;
> +	list_for_each_entry(g, &d->group_list, next) {
> +		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
> +			   vfio_dev_compute_msi_map_info);
> +		if (ret)
> +			goto out;
> +	}
> +out:
> +	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
> +		list_del(&iter->next);
> +		kfree(iter);
> +	}
> +	return ret;
> +}
> +
> +/**
> + * vfio_compute_msi_map_info: compute MSI mapping requirements
> + *
> + * Do some MSI addresses need to be mapped? IOMMU page size?
> + * Max number of IOVA pages needed by any domain to map MSI
> + *
> + * @iommu: iommu handle
> + * @info: msi map info handle
> + *
> + * returns 0 upon success or -ENOMEM
> + */
> +static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
> +				 struct vfio_msi_map_info *msi_info)
> +{
> +	int ret = 0;
> +	struct vfio_domain *d;
> +	unsigned long bitmap = ULONG_MAX;
> +	unsigned int iova_pages = 0;
> +
> +	msi_info->mapping_required = false;
> +
> +	mutex_lock(&iommu->lock);
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		bitmap &= d->domain->ops->pgsize_bitmap;
> +		ret = vfio_domain_compute_msi_map_info(d, msi_info);
> +		if (ret)
> +			goto out;
> +		if (msi_info->iova_pages > iova_pages)
> +			iova_pages = msi_info->iova_pages;
> +	}
> +out:
> +	msi_info->iova_pages = iova_pages;
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +#endif
> +
>  /*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> @@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		info.flags = VFIO_IOMMU_INFO_PGSIZES;
>  
> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
> +		{
> +			struct vfio_msi_map_info msi_info;
> +			int ret;
> +
> +			ret = vfio_compute_msi_map_info(iommu, &msi_info);
> +			if (ret)
> +				return ret;
> +
> +			if (msi_info.mapping_required)
> +				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
> +			info.msi_iova_pages = msi_info.iova_pages;
> +		}
> +#endif
>  		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>  		return copy_to_user((void __user *)arg, &info, minsz) ?
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index a49be8a..e3e501c 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
> +#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
>  	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
> +	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
>  };
>  
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)

Take a look at the capability chain extensions I used for adding some
new capabilities for vfio regions and let me know why we shouldn't do
something similar for this info ioctl.  A fixed structure gets messy
almost instantly when we start adding new fields to it.  Thanks,

Alex

c84982a vfio: Define capability chains
d7a8d5e vfio: Add capability chain helpers
ff63eb6 vfio: Define sparse mmap capability for regions
188ad9d vfio/pci: Include sparse mmap capability for MSI-X table regions
c7bb4cb vfio: Define device specific region type capability
28541d4 vfio/pci: Add infrastructure for additional device specific regions

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-07 13:43       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-07 13:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, robin.murphy, will.deacon, joro, tglx, jason,
	marc.zyngier, christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

Hi Alex,
On 04/07/2016 12:07 AM, Alex Williamson wrote:
> On Mon,  4 Apr 2016 08:30:08 +0000
> Eric Auger <eric.auger@linaro.org> wrote:
> 
>> The user is allowed to [un]register a reserved IOVA range by using the
>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>> It provides the base address and the size. This region is stored in the
>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>> address yet. The host kernel will use those iova when needed, typically
>> when the VFIO-PCI device allocates its MSIs.
>>
>> This patch also handles the destruction of the reserved binding RB-tree and
>> domain's iova_domains.
>>
>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
>>
>> ---
>> v3 -> v4:
>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>> - protect vfio_register_reserved_iova_range implementation with
>>   CONFIG_IOMMU_DMA_RESERVED
>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>
>> v1 -> v2:
>> - set returned value according to alloc_reserved_iova_domain result
>> - free the iova domains in case any error occurs
>>
>> RFC v1 -> v1:
>> - takes into account Alex comments, based on
>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>> - use the existing dma map/unmap ioctl interface with a flag to register
>>   a reserved IOVA range. A single reserved iova region is allowed.
>>
>> Conflicts:
>> 	drivers/vfio/vfio_iommu_type1.c
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>  include/uapi/linux/vfio.h       |  12 +++-
>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index c9ddbde..4497b20 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -36,6 +36,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>>  #include <linux/workqueue.h>
>> +#include <linux/dma-reserved-iommu.h>
>>  
>>  #define DRIVER_VERSION  "0.2"
>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  	vfio_lock_acct(-unlocked);
>>  }
>>  
>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	struct vfio_domain *d;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next)
>> +		iommu_unmap_reserved(d->domain);
>> +#endif
>> +}
>> +
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>  		vfio_unmap_unpin(iommu, dma);
>> +	else
>> +		vfio_unmap_reserved(iommu);
>>  	vfio_unlink_dma(iommu, dma);
>>  	kfree(dma);
>>  }
> 
> This makes me nervous, apparently we can add reserved mappings
> individually, but we have absolutely no granularity on remove, so if we
> remove one, we've removed them all even though we still have them
> linked in our rb tree.  I see later that only one reserved region is
> allowed, but that seems very short sighted, especially to impose that
> on the user level API.
On kernel-size the reserved region is currently backed by a unique
iova_domain. Do you mean you would like me to handle a list of
iova_domains instead of using a single "cookie"?
> 
>> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	 */
>>  	if (iommu->v2) {
>>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
>> -		if (dma && dma->iova != unmap->iova) {
>> +		if (dma && (dma->iova != unmap->iova ||
>> +			   (dma->type == VFIO_IOVA_RESERVED))) {
> 
> This seems unnecessary, won't the reserved entries fall out in the
> while loop below?
yes that's correct
> 
>>  			ret = -EINVAL;
>>  			goto unlock;
>>  		}
>> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	}
>>  
>>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
>> +		if (dma->type == VFIO_IOVA_RESERVED) {
>> +			ret = -EINVAL;
>> +			goto unlock;
>> +		}
> 
> Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
> size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
> do that if they've registered any reserved regions.  Seems like maybe
> we should ignore it and continue instead of abort, but then we need to
> change the parameters of vfio_find_dma() to get it to move on, or pass
> the type to the function, which would prevent us from getting here in
> the first place.
OK I will rework this to match the existing use cases
> 
>>  		if (!iommu->v2 && unmap->iova > dma->iova)
>>  			break;
>>  		unmapped += dma->size;
>> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	return ret;
>>  }
>>  
>> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
>> +			   struct vfio_iommu_type1_dma_map *map)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	dma_addr_t iova = map->iova;
>> +	size_t size = map->size;
>> +	uint64_t mask;
>> +	struct vfio_dma *dma;
>> +	int ret = 0;
>> +	struct vfio_domain *d;
>> +	unsigned long order;
>> +
>> +	/* Verify that none of our __u64 fields overflow */
>> +	if (map->size != size || map->iova != iova)
>> +		return -EINVAL;
>> +
>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>> +	mask = ((uint64_t)1 << order) - 1;
>> +
>> +	WARN_ON(mask & PAGE_MASK);
>> +
>> +	if (!size || (size | iova) & mask)
>> +		return -EINVAL;
>> +
>> +	/* Don't allow IOVA address wrap */
>> +	if (iova + size - 1 < iova)
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	if (vfio_find_dma(iommu, iova, size)) {
>> +		ret =  -EEXIST;
>> +		goto out;
>> +	}
>> +
>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>> +	if (!dma) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	dma->iova = iova;
>> +	dma->size = size;
>> +	dma->type = VFIO_IOVA_RESERVED;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next)
>> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
>> +							size, order);
>> +
>> +	if (ret) {
>> +		list_for_each_entry(d, &iommu->domain_list, next)
>> +			iommu_free_reserved_iova_domain(d->domain);
>> +		goto out;
>> +	}
>> +
>> +	vfio_link_dma(iommu, dma);
>> +
>> +out:
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +#else /* CONFIG_IOMMU_DMA_RESERVED */
>> +	return -ENODEV;
>> +#endif
>> +}
>> +
>> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
>> +				struct vfio_iommu_type1_dma_unmap *unmap)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	dma_addr_t iova = unmap->iova;
>> +	struct vfio_dma *dma;
>> +	size_t size = unmap->size;
>> +	uint64_t mask;
>> +	unsigned long order;
>> +
>> +	/* Verify that none of our __u64 fields overflow */
>> +	if (unmap->size != size || unmap->iova != iova)
>> +		return;
>> +
>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>> +	mask = ((uint64_t)1 << order) - 1;
>> +
>> +	WARN_ON(mask & PAGE_MASK);
>> +
>> +	if (!size || (size | iova) & mask)
>> +		return;
>> +
>> +	/* Don't allow IOVA address wrap */
>> +	if (iova + size - 1 < iova)
>> +		return;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	dma = vfio_find_dma(iommu, iova, size);
>> +
>> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
>> +		unmap->size = 0;
>> +		goto out;
>> +	}
>> +
>> +	unmap->size =  dma->size;
>> +	vfio_remove_dma(iommu, dma);
>> +
>> +out:
>> +	mutex_unlock(&iommu->lock);
>> +#endif
> 
> Having a find_dma that accepts a type and a remove_reserved here seems
> like it might simplify things.
> 
>> +}
>> +
>>  static int vfio_bus_type(struct device *dev, void *data)
>>  {
>>  	struct bus_type **bus = data;
>> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  	struct vfio_group *group, *group_tmp;
>>  
>>  	vfio_iommu_unmap_unpin_all(iommu);
>> +	vfio_unmap_reserved(iommu);
> 
> If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
> need to handle reserved entries?  We might as well have a separate
> vfio_remove_reserved_dma().
> 
>>  
>>  	list_for_each_entry_safe(domain, domain_tmp,
>>  				 &iommu->domain_list, next) {
>> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>>  		struct vfio_iommu_type1_dma_map map;
>>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
>> -				VFIO_DMA_MAP_FLAG_WRITE;
>> +				VFIO_DMA_MAP_FLAG_WRITE |
>> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>>  
>>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>  
>> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  		if (map.argsz < minsz || map.flags & ~mask)
>>  			return -EINVAL;
>>  
>> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
>> +			return vfio_register_reserved_iova_range(iommu, &map);
>> +
>>  		return vfio_dma_do_map(iommu, &map);
>>  
>>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  		if (unmap.argsz < minsz || unmap.flags)
>>  			return -EINVAL;
>>  
>> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
>> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
>> +			goto out;
>> +		}
>> +
>>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>>  		if (ret)
>>  			return ret;
>>  
>> +out:
>>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>  			-EFAULT : 0;
>>  	}
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 255a211..a49be8a 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>>   *
>>   * Map process virtual addresses to IO virtual addresses using the
>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>> + *
>> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
>> + * IOVA region which will be used on some platforms to map the host MSI frame.
>> + * in that specific case, vaddr and prot are ignored. The requirement for
>> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
>> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
>> + * MSI_RESERVED_IOVA region can be registered
>>   */
> 
> Why do we ignore read/write flags?  I'm not sure how useful a read-only
> reserved region might be, but certainly some platforms might support
> write-only or read-write.  Isn't this something we should let the IOMMU
> driver decide?  ie. pass it down and let it fail or not?
OK Makes sense. Actually I am not very clear about whether this API is
used for MSI binding only or likely to be used for something else.

  Also why are
> we making it the API spec to only allow a single reserved region of
> this type?  We could simply let additional ones fail, or better yet add
> a capability to the info ioctl to indicate the number available and
> then fail if the user exceeds it.
But this means that underneath we need to manage several iova_domains,
right?
> 
>>  struct vfio_iommu_type1_dma_map {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>> +/* reserved iova for MSI vectors*/
>> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
> 
> nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
> reserved flags seems like it puts the precedence in order.
OK
> 
>>  	__u64	vaddr;				/* Process virtual address */
>>  	__u64	iova;				/* IO virtual address */
>>  	__u64	size;				/* Size of mapping (bytes) */
>> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>>   * Caller sets argsz.  The actual unmapped size is returned in the size
>>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>>   * or size different from those used in the original mapping call will
>> - * succeed.
>> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
>> + * flag set.
> 
> So map/unmap become bi-modal, with this flag set they should only
> operate on reserved entries, otherwise they should operate on legacy
> entries.  So clearly as a user I should be able to continue doing an
> unmap from 0-(-1) of legacy entries and not stumble over reserved
> entries.  Thanks,
OK that's clear

Best Regards

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-07 13:43       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-07 13:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: julien.grall-5wv7dgnIgG8, eric.auger-qxv4g6HH51o,
	jason-NLaQJdtUoK4Be96aLqz0jA, kvm-u79uwXL29TY76Z2rM5mHXA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, will.deacon-5wv7dgnIgG8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

Hi Alex,
On 04/07/2016 12:07 AM, Alex Williamson wrote:
> On Mon,  4 Apr 2016 08:30:08 +0000
> Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> 
>> The user is allowed to [un]register a reserved IOVA range by using the
>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>> It provides the base address and the size. This region is stored in the
>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>> address yet. The host kernel will use those iova when needed, typically
>> when the VFIO-PCI device allocates its MSIs.
>>
>> This patch also handles the destruction of the reserved binding RB-tree and
>> domain's iova_domains.
>>
>> Signed-off-by: Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
>>
>> ---
>> v3 -> v4:
>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>> - protect vfio_register_reserved_iova_range implementation with
>>   CONFIG_IOMMU_DMA_RESERVED
>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>
>> v1 -> v2:
>> - set returned value according to alloc_reserved_iova_domain result
>> - free the iova domains in case any error occurs
>>
>> RFC v1 -> v1:
>> - takes into account Alex comments, based on
>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>> - use the existing dma map/unmap ioctl interface with a flag to register
>>   a reserved IOVA range. A single reserved iova region is allowed.
>>
>> Conflicts:
>> 	drivers/vfio/vfio_iommu_type1.c
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>  include/uapi/linux/vfio.h       |  12 +++-
>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index c9ddbde..4497b20 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -36,6 +36,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>>  #include <linux/workqueue.h>
>> +#include <linux/dma-reserved-iommu.h>
>>  
>>  #define DRIVER_VERSION  "0.2"
>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  	vfio_lock_acct(-unlocked);
>>  }
>>  
>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	struct vfio_domain *d;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next)
>> +		iommu_unmap_reserved(d->domain);
>> +#endif
>> +}
>> +
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>  		vfio_unmap_unpin(iommu, dma);
>> +	else
>> +		vfio_unmap_reserved(iommu);
>>  	vfio_unlink_dma(iommu, dma);
>>  	kfree(dma);
>>  }
> 
> This makes me nervous, apparently we can add reserved mappings
> individually, but we have absolutely no granularity on remove, so if we
> remove one, we've removed them all even though we still have them
> linked in our rb tree.  I see later that only one reserved region is
> allowed, but that seems very short sighted, especially to impose that
> on the user level API.
On kernel-size the reserved region is currently backed by a unique
iova_domain. Do you mean you would like me to handle a list of
iova_domains instead of using a single "cookie"?
> 
>> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	 */
>>  	if (iommu->v2) {
>>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
>> -		if (dma && dma->iova != unmap->iova) {
>> +		if (dma && (dma->iova != unmap->iova ||
>> +			   (dma->type == VFIO_IOVA_RESERVED))) {
> 
> This seems unnecessary, won't the reserved entries fall out in the
> while loop below?
yes that's correct
> 
>>  			ret = -EINVAL;
>>  			goto unlock;
>>  		}
>> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	}
>>  
>>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
>> +		if (dma->type == VFIO_IOVA_RESERVED) {
>> +			ret = -EINVAL;
>> +			goto unlock;
>> +		}
> 
> Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
> size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
> do that if they've registered any reserved regions.  Seems like maybe
> we should ignore it and continue instead of abort, but then we need to
> change the parameters of vfio_find_dma() to get it to move on, or pass
> the type to the function, which would prevent us from getting here in
> the first place.
OK I will rework this to match the existing use cases
> 
>>  		if (!iommu->v2 && unmap->iova > dma->iova)
>>  			break;
>>  		unmapped += dma->size;
>> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	return ret;
>>  }
>>  
>> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
>> +			   struct vfio_iommu_type1_dma_map *map)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	dma_addr_t iova = map->iova;
>> +	size_t size = map->size;
>> +	uint64_t mask;
>> +	struct vfio_dma *dma;
>> +	int ret = 0;
>> +	struct vfio_domain *d;
>> +	unsigned long order;
>> +
>> +	/* Verify that none of our __u64 fields overflow */
>> +	if (map->size != size || map->iova != iova)
>> +		return -EINVAL;
>> +
>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>> +	mask = ((uint64_t)1 << order) - 1;
>> +
>> +	WARN_ON(mask & PAGE_MASK);
>> +
>> +	if (!size || (size | iova) & mask)
>> +		return -EINVAL;
>> +
>> +	/* Don't allow IOVA address wrap */
>> +	if (iova + size - 1 < iova)
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	if (vfio_find_dma(iommu, iova, size)) {
>> +		ret =  -EEXIST;
>> +		goto out;
>> +	}
>> +
>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>> +	if (!dma) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	dma->iova = iova;
>> +	dma->size = size;
>> +	dma->type = VFIO_IOVA_RESERVED;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next)
>> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
>> +							size, order);
>> +
>> +	if (ret) {
>> +		list_for_each_entry(d, &iommu->domain_list, next)
>> +			iommu_free_reserved_iova_domain(d->domain);
>> +		goto out;
>> +	}
>> +
>> +	vfio_link_dma(iommu, dma);
>> +
>> +out:
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +#else /* CONFIG_IOMMU_DMA_RESERVED */
>> +	return -ENODEV;
>> +#endif
>> +}
>> +
>> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
>> +				struct vfio_iommu_type1_dma_unmap *unmap)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	dma_addr_t iova = unmap->iova;
>> +	struct vfio_dma *dma;
>> +	size_t size = unmap->size;
>> +	uint64_t mask;
>> +	unsigned long order;
>> +
>> +	/* Verify that none of our __u64 fields overflow */
>> +	if (unmap->size != size || unmap->iova != iova)
>> +		return;
>> +
>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>> +	mask = ((uint64_t)1 << order) - 1;
>> +
>> +	WARN_ON(mask & PAGE_MASK);
>> +
>> +	if (!size || (size | iova) & mask)
>> +		return;
>> +
>> +	/* Don't allow IOVA address wrap */
>> +	if (iova + size - 1 < iova)
>> +		return;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	dma = vfio_find_dma(iommu, iova, size);
>> +
>> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
>> +		unmap->size = 0;
>> +		goto out;
>> +	}
>> +
>> +	unmap->size =  dma->size;
>> +	vfio_remove_dma(iommu, dma);
>> +
>> +out:
>> +	mutex_unlock(&iommu->lock);
>> +#endif
> 
> Having a find_dma that accepts a type and a remove_reserved here seems
> like it might simplify things.
> 
>> +}
>> +
>>  static int vfio_bus_type(struct device *dev, void *data)
>>  {
>>  	struct bus_type **bus = data;
>> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  	struct vfio_group *group, *group_tmp;
>>  
>>  	vfio_iommu_unmap_unpin_all(iommu);
>> +	vfio_unmap_reserved(iommu);
> 
> If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
> need to handle reserved entries?  We might as well have a separate
> vfio_remove_reserved_dma().
> 
>>  
>>  	list_for_each_entry_safe(domain, domain_tmp,
>>  				 &iommu->domain_list, next) {
>> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>>  		struct vfio_iommu_type1_dma_map map;
>>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
>> -				VFIO_DMA_MAP_FLAG_WRITE;
>> +				VFIO_DMA_MAP_FLAG_WRITE |
>> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>>  
>>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>  
>> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  		if (map.argsz < minsz || map.flags & ~mask)
>>  			return -EINVAL;
>>  
>> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
>> +			return vfio_register_reserved_iova_range(iommu, &map);
>> +
>>  		return vfio_dma_do_map(iommu, &map);
>>  
>>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  		if (unmap.argsz < minsz || unmap.flags)
>>  			return -EINVAL;
>>  
>> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
>> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
>> +			goto out;
>> +		}
>> +
>>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>>  		if (ret)
>>  			return ret;
>>  
>> +out:
>>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>  			-EFAULT : 0;
>>  	}
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 255a211..a49be8a 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>>   *
>>   * Map process virtual addresses to IO virtual addresses using the
>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>> + *
>> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
>> + * IOVA region which will be used on some platforms to map the host MSI frame.
>> + * in that specific case, vaddr and prot are ignored. The requirement for
>> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
>> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
>> + * MSI_RESERVED_IOVA region can be registered
>>   */
> 
> Why do we ignore read/write flags?  I'm not sure how useful a read-only
> reserved region might be, but certainly some platforms might support
> write-only or read-write.  Isn't this something we should let the IOMMU
> driver decide?  ie. pass it down and let it fail or not?
OK Makes sense. Actually I am not very clear about whether this API is
used for MSI binding only or likely to be used for something else.

  Also why are
> we making it the API spec to only allow a single reserved region of
> this type?  We could simply let additional ones fail, or better yet add
> a capability to the info ioctl to indicate the number available and
> then fail if the user exceeds it.
But this means that underneath we need to manage several iova_domains,
right?
> 
>>  struct vfio_iommu_type1_dma_map {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>> +/* reserved iova for MSI vectors*/
>> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
> 
> nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
> reserved flags seems like it puts the precedence in order.
OK
> 
>>  	__u64	vaddr;				/* Process virtual address */
>>  	__u64	iova;				/* IO virtual address */
>>  	__u64	size;				/* Size of mapping (bytes) */
>> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>>   * Caller sets argsz.  The actual unmapped size is returned in the size
>>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>>   * or size different from those used in the original mapping call will
>> - * succeed.
>> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
>> + * flag set.
> 
> So map/unmap become bi-modal, with this flag set they should only
> operate on reserved entries, otherwise they should operate on legacy
> entries.  So clearly as a user I should be able to continue doing an
> unmap from 0-(-1) of legacy entries and not stumble over reserved
> entries.  Thanks,
OK that's clear

Best Regards

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-07 13:43       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-07 13:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Alex,
On 04/07/2016 12:07 AM, Alex Williamson wrote:
> On Mon,  4 Apr 2016 08:30:08 +0000
> Eric Auger <eric.auger@linaro.org> wrote:
> 
>> The user is allowed to [un]register a reserved IOVA range by using the
>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>> It provides the base address and the size. This region is stored in the
>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>> address yet. The host kernel will use those iova when needed, typically
>> when the VFIO-PCI device allocates its MSIs.
>>
>> This patch also handles the destruction of the reserved binding RB-tree and
>> domain's iova_domains.
>>
>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
>>
>> ---
>> v3 -> v4:
>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>> - protect vfio_register_reserved_iova_range implementation with
>>   CONFIG_IOMMU_DMA_RESERVED
>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>
>> v1 -> v2:
>> - set returned value according to alloc_reserved_iova_domain result
>> - free the iova domains in case any error occurs
>>
>> RFC v1 -> v1:
>> - takes into account Alex comments, based on
>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>> - use the existing dma map/unmap ioctl interface with a flag to register
>>   a reserved IOVA range. A single reserved iova region is allowed.
>>
>> Conflicts:
>> 	drivers/vfio/vfio_iommu_type1.c
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>  include/uapi/linux/vfio.h       |  12 +++-
>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index c9ddbde..4497b20 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -36,6 +36,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>>  #include <linux/workqueue.h>
>> +#include <linux/dma-reserved-iommu.h>
>>  
>>  #define DRIVER_VERSION  "0.2"
>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  	vfio_lock_acct(-unlocked);
>>  }
>>  
>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	struct vfio_domain *d;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next)
>> +		iommu_unmap_reserved(d->domain);
>> +#endif
>> +}
>> +
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>  		vfio_unmap_unpin(iommu, dma);
>> +	else
>> +		vfio_unmap_reserved(iommu);
>>  	vfio_unlink_dma(iommu, dma);
>>  	kfree(dma);
>>  }
> 
> This makes me nervous, apparently we can add reserved mappings
> individually, but we have absolutely no granularity on remove, so if we
> remove one, we've removed them all even though we still have them
> linked in our rb tree.  I see later that only one reserved region is
> allowed, but that seems very short sighted, especially to impose that
> on the user level API.
On kernel-size the reserved region is currently backed by a unique
iova_domain. Do you mean you would like me to handle a list of
iova_domains instead of using a single "cookie"?
> 
>> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	 */
>>  	if (iommu->v2) {
>>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
>> -		if (dma && dma->iova != unmap->iova) {
>> +		if (dma && (dma->iova != unmap->iova ||
>> +			   (dma->type == VFIO_IOVA_RESERVED))) {
> 
> This seems unnecessary, won't the reserved entries fall out in the
> while loop below?
yes that's correct
> 
>>  			ret = -EINVAL;
>>  			goto unlock;
>>  		}
>> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	}
>>  
>>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
>> +		if (dma->type == VFIO_IOVA_RESERVED) {
>> +			ret = -EINVAL;
>> +			goto unlock;
>> +		}
> 
> Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
> size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
> do that if they've registered any reserved regions.  Seems like maybe
> we should ignore it and continue instead of abort, but then we need to
> change the parameters of vfio_find_dma() to get it to move on, or pass
> the type to the function, which would prevent us from getting here in
> the first place.
OK I will rework this to match the existing use cases
> 
>>  		if (!iommu->v2 && unmap->iova > dma->iova)
>>  			break;
>>  		unmapped += dma->size;
>> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	return ret;
>>  }
>>  
>> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
>> +			   struct vfio_iommu_type1_dma_map *map)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	dma_addr_t iova = map->iova;
>> +	size_t size = map->size;
>> +	uint64_t mask;
>> +	struct vfio_dma *dma;
>> +	int ret = 0;
>> +	struct vfio_domain *d;
>> +	unsigned long order;
>> +
>> +	/* Verify that none of our __u64 fields overflow */
>> +	if (map->size != size || map->iova != iova)
>> +		return -EINVAL;
>> +
>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>> +	mask = ((uint64_t)1 << order) - 1;
>> +
>> +	WARN_ON(mask & PAGE_MASK);
>> +
>> +	if (!size || (size | iova) & mask)
>> +		return -EINVAL;
>> +
>> +	/* Don't allow IOVA address wrap */
>> +	if (iova + size - 1 < iova)
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	if (vfio_find_dma(iommu, iova, size)) {
>> +		ret =  -EEXIST;
>> +		goto out;
>> +	}
>> +
>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>> +	if (!dma) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	dma->iova = iova;
>> +	dma->size = size;
>> +	dma->type = VFIO_IOVA_RESERVED;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next)
>> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
>> +							size, order);
>> +
>> +	if (ret) {
>> +		list_for_each_entry(d, &iommu->domain_list, next)
>> +			iommu_free_reserved_iova_domain(d->domain);
>> +		goto out;
>> +	}
>> +
>> +	vfio_link_dma(iommu, dma);
>> +
>> +out:
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +#else /* CONFIG_IOMMU_DMA_RESERVED */
>> +	return -ENODEV;
>> +#endif
>> +}
>> +
>> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
>> +				struct vfio_iommu_type1_dma_unmap *unmap)
>> +{
>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>> +	dma_addr_t iova = unmap->iova;
>> +	struct vfio_dma *dma;
>> +	size_t size = unmap->size;
>> +	uint64_t mask;
>> +	unsigned long order;
>> +
>> +	/* Verify that none of our __u64 fields overflow */
>> +	if (unmap->size != size || unmap->iova != iova)
>> +		return;
>> +
>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>> +	mask = ((uint64_t)1 << order) - 1;
>> +
>> +	WARN_ON(mask & PAGE_MASK);
>> +
>> +	if (!size || (size | iova) & mask)
>> +		return;
>> +
>> +	/* Don't allow IOVA address wrap */
>> +	if (iova + size - 1 < iova)
>> +		return;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	dma = vfio_find_dma(iommu, iova, size);
>> +
>> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
>> +		unmap->size = 0;
>> +		goto out;
>> +	}
>> +
>> +	unmap->size =  dma->size;
>> +	vfio_remove_dma(iommu, dma);
>> +
>> +out:
>> +	mutex_unlock(&iommu->lock);
>> +#endif
> 
> Having a find_dma that accepts a type and a remove_reserved here seems
> like it might simplify things.
> 
>> +}
>> +
>>  static int vfio_bus_type(struct device *dev, void *data)
>>  {
>>  	struct bus_type **bus = data;
>> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  	struct vfio_group *group, *group_tmp;
>>  
>>  	vfio_iommu_unmap_unpin_all(iommu);
>> +	vfio_unmap_reserved(iommu);
> 
> If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
> need to handle reserved entries?  We might as well have a separate
> vfio_remove_reserved_dma().
> 
>>  
>>  	list_for_each_entry_safe(domain, domain_tmp,
>>  				 &iommu->domain_list, next) {
>> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>>  		struct vfio_iommu_type1_dma_map map;
>>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
>> -				VFIO_DMA_MAP_FLAG_WRITE;
>> +				VFIO_DMA_MAP_FLAG_WRITE |
>> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>>  
>>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>  
>> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  		if (map.argsz < minsz || map.flags & ~mask)
>>  			return -EINVAL;
>>  
>> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
>> +			return vfio_register_reserved_iova_range(iommu, &map);
>> +
>>  		return vfio_dma_do_map(iommu, &map);
>>  
>>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  		if (unmap.argsz < minsz || unmap.flags)
>>  			return -EINVAL;
>>  
>> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
>> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
>> +			goto out;
>> +		}
>> +
>>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>>  		if (ret)
>>  			return ret;
>>  
>> +out:
>>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>  			-EFAULT : 0;
>>  	}
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 255a211..a49be8a 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>>   *
>>   * Map process virtual addresses to IO virtual addresses using the
>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>> + *
>> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
>> + * IOVA region which will be used on some platforms to map the host MSI frame.
>> + * in that specific case, vaddr and prot are ignored. The requirement for
>> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
>> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
>> + * MSI_RESERVED_IOVA region can be registered
>>   */
> 
> Why do we ignore read/write flags?  I'm not sure how useful a read-only
> reserved region might be, but certainly some platforms might support
> write-only or read-write.  Isn't this something we should let the IOMMU
> driver decide?  ie. pass it down and let it fail or not?
OK Makes sense. Actually I am not very clear about whether this API is
used for MSI binding only or likely to be used for something else.

  Also why are
> we making it the API spec to only allow a single reserved region of
> this type?  We could simply let additional ones fail, or better yet add
> a capability to the info ioctl to indicate the number available and
> then fail if the user exceeds it.
But this means that underneath we need to manage several iova_domains,
right?
> 
>>  struct vfio_iommu_type1_dma_map {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>> +/* reserved iova for MSI vectors*/
>> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
> 
> nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
> reserved flags seems like it puts the precedence in order.
OK
> 
>>  	__u64	vaddr;				/* Process virtual address */
>>  	__u64	iova;				/* IO virtual address */
>>  	__u64	size;				/* Size of mapping (bytes) */
>> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>>   * Caller sets argsz.  The actual unmapped size is returned in the size
>>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>>   * or size different from those used in the original mapping call will
>> - * succeed.
>> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
>> + * flag set.
> 
> So map/unmap become bi-modal, with this flag set they should only
> operate on reserved entries, otherwise they should operate on legacy
> entries.  So clearly as a user I should be able to continue doing an
> unmap from 0-(-1) of legacy entries and not stumble over reserved
> entries.  Thanks,
OK that's clear

Best Regards

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-07 13:44       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-07 13:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, robin.murphy, will.deacon, joro, tglx, jason,
	marc.zyngier, christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

On 04/07/2016 12:32 AM, Alex Williamson wrote:
> On Mon,  4 Apr 2016 08:30:11 +0000
> Eric Auger <eric.auger@linaro.org> wrote:
> 
>> This patch allows the user-space to know whether MSI addresses need to
>> be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
>> IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.
>>
>> Also the number of IOMMU pages requested to map those is returned in
>> msi_iova_pages field. User-space must use this information to allocate an
>> IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
>> it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).
>>
>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>
>> ---
>>
>> Currently it is assumed a single doorbell page is used per MSI controller.
>> This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
>> If an MSI controller were to expose more doorbells it could implement a
>> new callback at irq_chip interface.
>>
>> v4 -> v5:
>> - move msi_info and ret declaration within the conditional code
>>
>> v3 -> v4:
>> - replace former vfio_domains_require_msi_mapping by
>>   more complex computation of MSI mapping requirements, especially the
>>   number of pages to be provided by the user-space.
>> - reword patch title
>>
>> RFC v1 -> v1:
>> - derived from
>>   [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
>> - renamed allow_msi_reconfig into require_msi_mapping
>> - fixed VFIO_IOMMU_GET_INFO
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/vfio.h       |   2 +
>>  2 files changed, 149 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index b330b81..f1def50 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -39,6 +39,7 @@
>>  #include <linux/dma-reserved-iommu.h>
>>  #include <linux/irqdomain.h>
>>  #include <linux/msi.h>
>> +#include <linux/irq.h>
>>  
>>  #define DRIVER_VERSION  "0.2"
>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>> @@ -95,6 +96,17 @@ struct vfio_group {
>>  	struct list_head	next;
>>  };
>>  
>> +struct vfio_irq_chip {
>> +	struct list_head next;
>> +	struct irq_chip *chip;
>> +};
>> +
>> +struct vfio_msi_map_info {
>> +	bool mapping_required;
>> +	unsigned int iova_pages;
>> +	struct list_head irq_chip_list;
>> +};
>> +
>>  /*
>>   * This code handles mapping and unmapping of user data buffers
>>   * into DMA'ble space using the IOMMU
>> @@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>>  	return ret;
>>  }
>>  
>> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
>> +/**
>> + * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
>> + * the @dev device requirements.
>> + *
>> + * @dev: device handle
>> + * @data: opaque pointing to a struct vfio_msi_map_info
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
>> +{
>> +	struct irq_domain *domain;
>> +	struct msi_domain_info *info;
>> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
>> +	struct irq_chip *chip;
>> +	struct vfio_irq_chip *iter, *new;
>> +
>> +	domain = dev_get_msi_domain(dev);
>> +	if (!domain)
>> +		return 0;
>> +
>> +	/* Let's compute the needs for the MSI domain */
>> +	info = msi_get_domain_info(domain);
>> +	chip = info->chip;
>> +	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
>> +		if (iter->chip == chip)
>> +			return 0;
>> +	}
>> +
>> +	new = kzalloc(sizeof(*new), GFP_KERNEL);
>> +	if (!new)
>> +		return -ENOMEM;
>> +
>> +	new->chip = chip;
>> +
>> +	list_add(&new->next, &msi_info->irq_chip_list);
>> +
>> +	/*
>> +	 * new irq_chip to be taken into account; we currently assume
>> +	 * a single iova doorbell by irq chip requesting MSI mapping
>> +	 */
>> +	msi_info->iova_pages += 1;
>> +	return 0;
>> +}
>> +
>> +/**
>> + * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
>> + * for vfio_domain @d
>> + *
>> + * @d: vfio domain handle
>> + * @data: opaque pointing to a struct vfio_msi_map_info
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
>> +{
>> +	int ret = 0;
>> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
>> +	struct vfio_irq_chip *iter, *tmp;
>> +	struct vfio_group *g;
>> +
>> +	msi_info->iova_pages = 0;
>> +	INIT_LIST_HEAD(&msi_info->irq_chip_list);
>> +
>> +	if (iommu_domain_get_attr(d->domain,
>> +				   DOMAIN_ATTR_MSI_MAPPING, NULL))
>> +		return 0;
>> +	msi_info->mapping_required = true;
>> +	list_for_each_entry(g, &d->group_list, next) {
>> +		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
>> +			   vfio_dev_compute_msi_map_info);
>> +		if (ret)
>> +			goto out;
>> +	}
>> +out:
>> +	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
>> +		list_del(&iter->next);
>> +		kfree(iter);
>> +	}
>> +	return ret;
>> +}
>> +
>> +/**
>> + * vfio_compute_msi_map_info: compute MSI mapping requirements
>> + *
>> + * Do some MSI addresses need to be mapped? IOMMU page size?
>> + * Max number of IOVA pages needed by any domain to map MSI
>> + *
>> + * @iommu: iommu handle
>> + * @info: msi map info handle
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
>> +				 struct vfio_msi_map_info *msi_info)
>> +{
>> +	int ret = 0;
>> +	struct vfio_domain *d;
>> +	unsigned long bitmap = ULONG_MAX;
>> +	unsigned int iova_pages = 0;
>> +
>> +	msi_info->mapping_required = false;
>> +
>> +	mutex_lock(&iommu->lock);
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		bitmap &= d->domain->ops->pgsize_bitmap;
>> +		ret = vfio_domain_compute_msi_map_info(d, msi_info);
>> +		if (ret)
>> +			goto out;
>> +		if (msi_info->iova_pages > iova_pages)
>> +			iova_pages = msi_info->iova_pages;
>> +	}
>> +out:
>> +	msi_info->iova_pages = iova_pages;
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +}
>> +
>> +#endif
>> +
>>  /*
>>   * Attempt to pin pages.  We really don't want to track all the pfns and
>>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>> @@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  
>>  		info.flags = VFIO_IOMMU_INFO_PGSIZES;
>>  
>> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
>> +		{
>> +			struct vfio_msi_map_info msi_info;
>> +			int ret;
>> +
>> +			ret = vfio_compute_msi_map_info(iommu, &msi_info);
>> +			if (ret)
>> +				return ret;
>> +
>> +			if (msi_info.mapping_required)
>> +				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
>> +			info.msi_iova_pages = msi_info.iova_pages;
>> +		}
>> +#endif
>>  		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>>  
>>  		return copy_to_user((void __user *)arg, &info, minsz) ?
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index a49be8a..e3e501c 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
>> +#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
>>  	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
>> +	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
>>  };
>>  
>>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
> Take a look at the capability chain extensions I used for adding some
> new capabilities for vfio regions and let me know why we shouldn't do
> something similar for this info ioctl.  A fixed structure gets messy
> almost instantly when we start adding new fields to it.  Thanks,

Ok

Thank you for your time

Best Regards

Eric
> 
> Alex
> 
> c84982a vfio: Define capability chains
> d7a8d5e vfio: Add capability chain helpers
> ff63eb6 vfio: Define sparse mmap capability for regions
> 188ad9d vfio/pci: Include sparse mmap capability for MSI-X table regions
> c7bb4cb vfio: Define device specific region type capability
> 28541d4 vfio/pci: Add infrastructure for additional device specific regions
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-07 13:44       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-07 13:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: julien.grall-5wv7dgnIgG8, eric.auger-qxv4g6HH51o,
	jason-NLaQJdtUoK4Be96aLqz0jA, kvm-u79uwXL29TY76Z2rM5mHXA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, will.deacon-5wv7dgnIgG8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

On 04/07/2016 12:32 AM, Alex Williamson wrote:
> On Mon,  4 Apr 2016 08:30:11 +0000
> Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> 
>> This patch allows the user-space to know whether MSI addresses need to
>> be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
>> IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.
>>
>> Also the number of IOMMU pages requested to map those is returned in
>> msi_iova_pages field. User-space must use this information to allocate an
>> IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
>> it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).
>>
>> Signed-off-by: Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
>>
>> ---
>>
>> Currently it is assumed a single doorbell page is used per MSI controller.
>> This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
>> If an MSI controller were to expose more doorbells it could implement a
>> new callback at irq_chip interface.
>>
>> v4 -> v5:
>> - move msi_info and ret declaration within the conditional code
>>
>> v3 -> v4:
>> - replace former vfio_domains_require_msi_mapping by
>>   more complex computation of MSI mapping requirements, especially the
>>   number of pages to be provided by the user-space.
>> - reword patch title
>>
>> RFC v1 -> v1:
>> - derived from
>>   [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
>> - renamed allow_msi_reconfig into require_msi_mapping
>> - fixed VFIO_IOMMU_GET_INFO
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/vfio.h       |   2 +
>>  2 files changed, 149 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index b330b81..f1def50 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -39,6 +39,7 @@
>>  #include <linux/dma-reserved-iommu.h>
>>  #include <linux/irqdomain.h>
>>  #include <linux/msi.h>
>> +#include <linux/irq.h>
>>  
>>  #define DRIVER_VERSION  "0.2"
>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
>> @@ -95,6 +96,17 @@ struct vfio_group {
>>  	struct list_head	next;
>>  };
>>  
>> +struct vfio_irq_chip {
>> +	struct list_head next;
>> +	struct irq_chip *chip;
>> +};
>> +
>> +struct vfio_msi_map_info {
>> +	bool mapping_required;
>> +	unsigned int iova_pages;
>> +	struct list_head irq_chip_list;
>> +};
>> +
>>  /*
>>   * This code handles mapping and unmapping of user data buffers
>>   * into DMA'ble space using the IOMMU
>> @@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>>  	return ret;
>>  }
>>  
>> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
>> +/**
>> + * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
>> + * the @dev device requirements.
>> + *
>> + * @dev: device handle
>> + * @data: opaque pointing to a struct vfio_msi_map_info
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
>> +{
>> +	struct irq_domain *domain;
>> +	struct msi_domain_info *info;
>> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
>> +	struct irq_chip *chip;
>> +	struct vfio_irq_chip *iter, *new;
>> +
>> +	domain = dev_get_msi_domain(dev);
>> +	if (!domain)
>> +		return 0;
>> +
>> +	/* Let's compute the needs for the MSI domain */
>> +	info = msi_get_domain_info(domain);
>> +	chip = info->chip;
>> +	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
>> +		if (iter->chip == chip)
>> +			return 0;
>> +	}
>> +
>> +	new = kzalloc(sizeof(*new), GFP_KERNEL);
>> +	if (!new)
>> +		return -ENOMEM;
>> +
>> +	new->chip = chip;
>> +
>> +	list_add(&new->next, &msi_info->irq_chip_list);
>> +
>> +	/*
>> +	 * new irq_chip to be taken into account; we currently assume
>> +	 * a single iova doorbell by irq chip requesting MSI mapping
>> +	 */
>> +	msi_info->iova_pages += 1;
>> +	return 0;
>> +}
>> +
>> +/**
>> + * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
>> + * for vfio_domain @d
>> + *
>> + * @d: vfio domain handle
>> + * @data: opaque pointing to a struct vfio_msi_map_info
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
>> +{
>> +	int ret = 0;
>> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
>> +	struct vfio_irq_chip *iter, *tmp;
>> +	struct vfio_group *g;
>> +
>> +	msi_info->iova_pages = 0;
>> +	INIT_LIST_HEAD(&msi_info->irq_chip_list);
>> +
>> +	if (iommu_domain_get_attr(d->domain,
>> +				   DOMAIN_ATTR_MSI_MAPPING, NULL))
>> +		return 0;
>> +	msi_info->mapping_required = true;
>> +	list_for_each_entry(g, &d->group_list, next) {
>> +		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
>> +			   vfio_dev_compute_msi_map_info);
>> +		if (ret)
>> +			goto out;
>> +	}
>> +out:
>> +	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
>> +		list_del(&iter->next);
>> +		kfree(iter);
>> +	}
>> +	return ret;
>> +}
>> +
>> +/**
>> + * vfio_compute_msi_map_info: compute MSI mapping requirements
>> + *
>> + * Do some MSI addresses need to be mapped? IOMMU page size?
>> + * Max number of IOVA pages needed by any domain to map MSI
>> + *
>> + * @iommu: iommu handle
>> + * @info: msi map info handle
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
>> +				 struct vfio_msi_map_info *msi_info)
>> +{
>> +	int ret = 0;
>> +	struct vfio_domain *d;
>> +	unsigned long bitmap = ULONG_MAX;
>> +	unsigned int iova_pages = 0;
>> +
>> +	msi_info->mapping_required = false;
>> +
>> +	mutex_lock(&iommu->lock);
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		bitmap &= d->domain->ops->pgsize_bitmap;
>> +		ret = vfio_domain_compute_msi_map_info(d, msi_info);
>> +		if (ret)
>> +			goto out;
>> +		if (msi_info->iova_pages > iova_pages)
>> +			iova_pages = msi_info->iova_pages;
>> +	}
>> +out:
>> +	msi_info->iova_pages = iova_pages;
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +}
>> +
>> +#endif
>> +
>>  /*
>>   * Attempt to pin pages.  We really don't want to track all the pfns and
>>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>> @@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  
>>  		info.flags = VFIO_IOMMU_INFO_PGSIZES;
>>  
>> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
>> +		{
>> +			struct vfio_msi_map_info msi_info;
>> +			int ret;
>> +
>> +			ret = vfio_compute_msi_map_info(iommu, &msi_info);
>> +			if (ret)
>> +				return ret;
>> +
>> +			if (msi_info.mapping_required)
>> +				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
>> +			info.msi_iova_pages = msi_info.iova_pages;
>> +		}
>> +#endif
>>  		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>>  
>>  		return copy_to_user((void __user *)arg, &info, minsz) ?
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index a49be8a..e3e501c 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
>> +#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
>>  	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
>> +	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
>>  };
>>  
>>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
> Take a look at the capability chain extensions I used for adding some
> new capabilities for vfio regions and let me know why we shouldn't do
> something similar for this info ioctl.  A fixed structure gets messy
> almost instantly when we start adding new fields to it.  Thanks,

Ok

Thank you for your time

Best Regards

Eric
> 
> Alex
> 
> c84982a vfio: Define capability chains
> d7a8d5e vfio: Add capability chain helpers
> ff63eb6 vfio: Define sparse mmap capability for regions
> 188ad9d vfio/pci: Include sparse mmap capability for MSI-X table regions
> c7bb4cb vfio: Define device specific region type capability
> 28541d4 vfio/pci: Add infrastructure for additional device specific regions
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO
@ 2016-04-07 13:44       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-07 13:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/07/2016 12:32 AM, Alex Williamson wrote:
> On Mon,  4 Apr 2016 08:30:11 +0000
> Eric Auger <eric.auger@linaro.org> wrote:
> 
>> This patch allows the user-space to know whether MSI addresses need to
>> be mapped in the IOMMU. The user-space uses VFIO_IOMMU_GET_INFO ioctl and
>> IOMMU_INFO_REQUIRE_MSI_MAP gets set if they need to.
>>
>> Also the number of IOMMU pages requested to map those is returned in
>> msi_iova_pages field. User-space must use this information to allocate an
>> IOVA contiguous region of size msi_iova_pages * ffs(iova_pgsizes) and pass
>> it with VFIO_IOMMU_MAP_DMA iotcl (VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA set).
>>
>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>
>> ---
>>
>> Currently it is assumed a single doorbell page is used per MSI controller.
>> This is the case for known ARM MSI controllers (GICv2M, GICv3 ITS, ...).
>> If an MSI controller were to expose more doorbells it could implement a
>> new callback at irq_chip interface.
>>
>> v4 -> v5:
>> - move msi_info and ret declaration within the conditional code
>>
>> v3 -> v4:
>> - replace former vfio_domains_require_msi_mapping by
>>   more complex computation of MSI mapping requirements, especially the
>>   number of pages to be provided by the user-space.
>> - reword patch title
>>
>> RFC v1 -> v1:
>> - derived from
>>   [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
>> - renamed allow_msi_reconfig into require_msi_mapping
>> - fixed VFIO_IOMMU_GET_INFO
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 147 ++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/vfio.h       |   2 +
>>  2 files changed, 149 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index b330b81..f1def50 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -39,6 +39,7 @@
>>  #include <linux/dma-reserved-iommu.h>
>>  #include <linux/irqdomain.h>
>>  #include <linux/msi.h>
>> +#include <linux/irq.h>
>>  
>>  #define DRIVER_VERSION  "0.2"
>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>> @@ -95,6 +96,17 @@ struct vfio_group {
>>  	struct list_head	next;
>>  };
>>  
>> +struct vfio_irq_chip {
>> +	struct list_head next;
>> +	struct irq_chip *chip;
>> +};
>> +
>> +struct vfio_msi_map_info {
>> +	bool mapping_required;
>> +	unsigned int iova_pages;
>> +	struct list_head irq_chip_list;
>> +};
>> +
>>  /*
>>   * This code handles mapping and unmapping of user data buffers
>>   * into DMA'ble space using the IOMMU
>> @@ -267,6 +279,127 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>>  	return ret;
>>  }
>>  
>> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
>> +/**
>> + * vfio_dev_compute_msi_map_info: augment MSI mapping info (@data) with
>> + * the @dev device requirements.
>> + *
>> + * @dev: device handle
>> + * @data: opaque pointing to a struct vfio_msi_map_info
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_dev_compute_msi_map_info(struct device *dev, void *data)
>> +{
>> +	struct irq_domain *domain;
>> +	struct msi_domain_info *info;
>> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
>> +	struct irq_chip *chip;
>> +	struct vfio_irq_chip *iter, *new;
>> +
>> +	domain = dev_get_msi_domain(dev);
>> +	if (!domain)
>> +		return 0;
>> +
>> +	/* Let's compute the needs for the MSI domain */
>> +	info = msi_get_domain_info(domain);
>> +	chip = info->chip;
>> +	list_for_each_entry(iter, &msi_info->irq_chip_list, next) {
>> +		if (iter->chip == chip)
>> +			return 0;
>> +	}
>> +
>> +	new = kzalloc(sizeof(*new), GFP_KERNEL);
>> +	if (!new)
>> +		return -ENOMEM;
>> +
>> +	new->chip = chip;
>> +
>> +	list_add(&new->next, &msi_info->irq_chip_list);
>> +
>> +	/*
>> +	 * new irq_chip to be taken into account; we currently assume
>> +	 * a single iova doorbell by irq chip requesting MSI mapping
>> +	 */
>> +	msi_info->iova_pages += 1;
>> +	return 0;
>> +}
>> +
>> +/**
>> + * vfio_domain_compute_msi_map_info: compute MSI mapping requirements (@data)
>> + * for vfio_domain @d
>> + *
>> + * @d: vfio domain handle
>> + * @data: opaque pointing to a struct vfio_msi_map_info
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_domain_compute_msi_map_info(struct vfio_domain *d, void *data)
>> +{
>> +	int ret = 0;
>> +	struct vfio_msi_map_info *msi_info = (struct vfio_msi_map_info *)data;
>> +	struct vfio_irq_chip *iter, *tmp;
>> +	struct vfio_group *g;
>> +
>> +	msi_info->iova_pages = 0;
>> +	INIT_LIST_HEAD(&msi_info->irq_chip_list);
>> +
>> +	if (iommu_domain_get_attr(d->domain,
>> +				   DOMAIN_ATTR_MSI_MAPPING, NULL))
>> +		return 0;
>> +	msi_info->mapping_required = true;
>> +	list_for_each_entry(g, &d->group_list, next) {
>> +		ret = iommu_group_for_each_dev(g->iommu_group, msi_info,
>> +			   vfio_dev_compute_msi_map_info);
>> +		if (ret)
>> +			goto out;
>> +	}
>> +out:
>> +	list_for_each_entry_safe(iter, tmp, &msi_info->irq_chip_list, next) {
>> +		list_del(&iter->next);
>> +		kfree(iter);
>> +	}
>> +	return ret;
>> +}
>> +
>> +/**
>> + * vfio_compute_msi_map_info: compute MSI mapping requirements
>> + *
>> + * Do some MSI addresses need to be mapped? IOMMU page size?
>> + * Max number of IOVA pages needed by any domain to map MSI
>> + *
>> + * @iommu: iommu handle
>> + * @info: msi map info handle
>> + *
>> + * returns 0 upon success or -ENOMEM
>> + */
>> +static int vfio_compute_msi_map_info(struct vfio_iommu *iommu,
>> +				 struct vfio_msi_map_info *msi_info)
>> +{
>> +	int ret = 0;
>> +	struct vfio_domain *d;
>> +	unsigned long bitmap = ULONG_MAX;
>> +	unsigned int iova_pages = 0;
>> +
>> +	msi_info->mapping_required = false;
>> +
>> +	mutex_lock(&iommu->lock);
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		bitmap &= d->domain->ops->pgsize_bitmap;
>> +		ret = vfio_domain_compute_msi_map_info(d, msi_info);
>> +		if (ret)
>> +			goto out;
>> +		if (msi_info->iova_pages > iova_pages)
>> +			iova_pages = msi_info->iova_pages;
>> +	}
>> +out:
>> +	msi_info->iova_pages = iova_pages;
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +}
>> +
>> +#endif
>> +
>>  /*
>>   * Attempt to pin pages.  We really don't want to track all the pfns and
>>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>> @@ -1179,6 +1312,20 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  
>>  		info.flags = VFIO_IOMMU_INFO_PGSIZES;
>>  
>> +#if defined(CONFIG_GENERIC_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_DMA_RESERVED)
>> +		{
>> +			struct vfio_msi_map_info msi_info;
>> +			int ret;
>> +
>> +			ret = vfio_compute_msi_map_info(iommu, &msi_info);
>> +			if (ret)
>> +				return ret;
>> +
>> +			if (msi_info.mapping_required)
>> +				info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
>> +			info.msi_iova_pages = msi_info.iova_pages;
>> +		}
>> +#endif
>>  		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>>  
>>  		return copy_to_user((void __user *)arg, &info, minsz) ?
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index a49be8a..e3e501c 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -488,7 +488,9 @@ struct vfio_iommu_type1_info {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
>> +#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
>>  	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
>> +	__u32   msi_iova_pages;	/* number of IOVA pages needed to map MSIs */
>>  };
>>  
>>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
> Take a look at the capability chain extensions I used for adding some
> new capabilities for vfio regions and let me know why we shouldn't do
> something similar for this info ioctl.  A fixed structure gets messy
> almost instantly when we start adding new fields to it.  Thanks,

Ok

Thank you for your time

Best Regards

Eric
> 
> Alex
> 
> c84982a vfio: Define capability chains
> d7a8d5e vfio: Add capability chain helpers
> ff63eb6 vfio: Define sparse mmap capability for regions
> 188ad9d vfio/pci: Include sparse mmap capability for MSI-X table regions
> c7bb4cb vfio: Define device specific region type capability
> 28541d4 vfio/pci: Add infrastructure for additional device specific regions
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
  2016-04-07 13:43       ` Eric Auger
@ 2016-04-07 18:29         ` Alex Williamson
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-07 18:29 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger, robin.murphy, will.deacon, joro, tglx, jason,
	marc.zyngier, christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

On Thu, 7 Apr 2016 15:43:29 +0200
Eric Auger <eric.auger@linaro.org> wrote:

> Hi Alex,
> On 04/07/2016 12:07 AM, Alex Williamson wrote:
> > On Mon,  4 Apr 2016 08:30:08 +0000
> > Eric Auger <eric.auger@linaro.org> wrote:
> >   
> >> The user is allowed to [un]register a reserved IOVA range by using the
> >> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
> >> It provides the base address and the size. This region is stored in the
> >> vfio_dma rb tree. At that point the iova range is not mapped to any target
> >> address yet. The host kernel will use those iova when needed, typically
> >> when the VFIO-PCI device allocates its MSIs.
> >>
> >> This patch also handles the destruction of the reserved binding RB-tree and
> >> domain's iova_domains.
> >>
> >> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> >> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
> >>
> >> ---
> >> v3 -> v4:
> >> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
> >> - protect vfio_register_reserved_iova_range implementation with
> >>   CONFIG_IOMMU_DMA_RESERVED
> >> - handle unregistration by user-space and on vfio_iommu_type1 release
> >>
> >> v1 -> v2:
> >> - set returned value according to alloc_reserved_iova_domain result
> >> - free the iova domains in case any error occurs
> >>
> >> RFC v1 -> v1:
> >> - takes into account Alex comments, based on
> >>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
> >> - use the existing dma map/unmap ioctl interface with a flag to register
> >>   a reserved IOVA range. A single reserved iova region is allowed.
> >>
> >> Conflicts:
> >> 	drivers/vfio/vfio_iommu_type1.c
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
> >>  include/uapi/linux/vfio.h       |  12 +++-
> >>  2 files changed, 150 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index c9ddbde..4497b20 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -36,6 +36,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/vfio.h>
> >>  #include <linux/workqueue.h>
> >> +#include <linux/dma-reserved-iommu.h>
> >>  
> >>  #define DRIVER_VERSION  "0.2"
> >>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> >> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>  	vfio_lock_acct(-unlocked);
> >>  }
> >>  
> >> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
> >> +{
> >> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >> +	struct vfio_domain *d;
> >> +
> >> +	list_for_each_entry(d, &iommu->domain_list, next)
> >> +		iommu_unmap_reserved(d->domain);
> >> +#endif
> >> +}
> >> +
> >>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>  {
> >>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
> >>  		vfio_unmap_unpin(iommu, dma);
> >> +	else
> >> +		vfio_unmap_reserved(iommu);
> >>  	vfio_unlink_dma(iommu, dma);
> >>  	kfree(dma);
> >>  }  
> > 
> > This makes me nervous, apparently we can add reserved mappings
> > individually, but we have absolutely no granularity on remove, so if we
> > remove one, we've removed them all even though we still have them
> > linked in our rb tree.  I see later that only one reserved region is
> > allowed, but that seems very short sighted, especially to impose that
> > on the user level API.  
> On kernel-size the reserved region is currently backed by a unique
> iova_domain. Do you mean you would like me to handle a list of
> iova_domains instead of using a single "cookie"?

TBH, I'm not really sure how this works with a single iova domain.  If
we have multiple irq chips and each gets mapped by a separate page in
the iova space, then is it really sufficient to do a lookup from the
irq_data to the msi_desc to the device to the domain in order to get a
reserved iova to map that msi doorbell?  Don't we need an iova from the
pool mapping the specific irqchip associated with our device?  The IOMMU
domain might span any number of irq chips, how can we assume there's
one only reserved iova space?  Maybe I'm not understanding how the code
works.

Conceptually, this is a generic IOMMU API extension to include reserved
iova space, MSI mappings are a consumer of that reserved iova pool but
I don't think we can say they will necessarily be the only consumer.
So building into the interface that there's only one is like making a
fixed length array to hold a string, it works for the initial
implementation, but it's not a robust solution.

I'm also trying to figure out how this maps to x86, the difference of
course being that for ARM you have a user specified, explicit MSI iova
space while x86 has an implicit MSI iova space.  So should x86 be
creating a reserved iova pool for the implicit mapping?  Should the
user have some way to query the mapping, whether implicit or explicit?
For instance, a new capability within the vfio iommu INFO ioctl might
expose reserved regions.  It might be initially present on x86 due to
the implicit nature of the reservation, while it might only appear on
ARM after submitting a reserved mapping.  Thanks,

Alex

 
> >> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  	 */
> >>  	if (iommu->v2) {
> >>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
> >> -		if (dma && dma->iova != unmap->iova) {
> >> +		if (dma && (dma->iova != unmap->iova ||
> >> +			   (dma->type == VFIO_IOVA_RESERVED))) {  
> > 
> > This seems unnecessary, won't the reserved entries fall out in the
> > while loop below?  
> yes that's correct
> >   
> >>  			ret = -EINVAL;
> >>  			goto unlock;
> >>  		}
> >> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  	}
> >>  
> >>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> >> +		if (dma->type == VFIO_IOVA_RESERVED) {
> >> +			ret = -EINVAL;
> >> +			goto unlock;
> >> +		}  
> > 
> > Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
> > size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
> > do that if they've registered any reserved regions.  Seems like maybe
> > we should ignore it and continue instead of abort, but then we need to
> > change the parameters of vfio_find_dma() to get it to move on, or pass
> > the type to the function, which would prevent us from getting here in
> > the first place.  
> OK I will rework this to match the existing use cases
> >   
> >>  		if (!iommu->v2 && unmap->iova > dma->iova)
> >>  			break;
> >>  		unmapped += dma->size;
> >> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  	return ret;
> >>  }
> >>  
> >> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
> >> +			   struct vfio_iommu_type1_dma_map *map)
> >> +{
> >> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >> +	dma_addr_t iova = map->iova;
> >> +	size_t size = map->size;
> >> +	uint64_t mask;
> >> +	struct vfio_dma *dma;
> >> +	int ret = 0;
> >> +	struct vfio_domain *d;
> >> +	unsigned long order;
> >> +
> >> +	/* Verify that none of our __u64 fields overflow */
> >> +	if (map->size != size || map->iova != iova)
> >> +		return -EINVAL;
> >> +
> >> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> >> +	mask = ((uint64_t)1 << order) - 1;
> >> +
> >> +	WARN_ON(mask & PAGE_MASK);
> >> +
> >> +	if (!size || (size | iova) & mask)
> >> +		return -EINVAL;
> >> +
> >> +	/* Don't allow IOVA address wrap */
> >> +	if (iova + size - 1 < iova)
> >> +		return -EINVAL;
> >> +
> >> +	mutex_lock(&iommu->lock);
> >> +
> >> +	if (vfio_find_dma(iommu, iova, size)) {
> >> +		ret =  -EEXIST;
> >> +		goto out;
> >> +	}
> >> +
> >> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> >> +	if (!dma) {
> >> +		ret = -ENOMEM;
> >> +		goto out;
> >> +	}
> >> +
> >> +	dma->iova = iova;
> >> +	dma->size = size;
> >> +	dma->type = VFIO_IOVA_RESERVED;
> >> +
> >> +	list_for_each_entry(d, &iommu->domain_list, next)
> >> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
> >> +							size, order);
> >> +
> >> +	if (ret) {
> >> +		list_for_each_entry(d, &iommu->domain_list, next)
> >> +			iommu_free_reserved_iova_domain(d->domain);
> >> +		goto out;
> >> +	}
> >> +
> >> +	vfio_link_dma(iommu, dma);
> >> +
> >> +out:
> >> +	mutex_unlock(&iommu->lock);
> >> +	return ret;
> >> +#else /* CONFIG_IOMMU_DMA_RESERVED */
> >> +	return -ENODEV;
> >> +#endif
> >> +}
> >> +
> >> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
> >> +				struct vfio_iommu_type1_dma_unmap *unmap)
> >> +{
> >> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >> +	dma_addr_t iova = unmap->iova;
> >> +	struct vfio_dma *dma;
> >> +	size_t size = unmap->size;
> >> +	uint64_t mask;
> >> +	unsigned long order;
> >> +
> >> +	/* Verify that none of our __u64 fields overflow */
> >> +	if (unmap->size != size || unmap->iova != iova)
> >> +		return;
> >> +
> >> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> >> +	mask = ((uint64_t)1 << order) - 1;
> >> +
> >> +	WARN_ON(mask & PAGE_MASK);
> >> +
> >> +	if (!size || (size | iova) & mask)
> >> +		return;
> >> +
> >> +	/* Don't allow IOVA address wrap */
> >> +	if (iova + size - 1 < iova)
> >> +		return;
> >> +
> >> +	mutex_lock(&iommu->lock);
> >> +
> >> +	dma = vfio_find_dma(iommu, iova, size);
> >> +
> >> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
> >> +		unmap->size = 0;
> >> +		goto out;
> >> +	}
> >> +
> >> +	unmap->size =  dma->size;
> >> +	vfio_remove_dma(iommu, dma);
> >> +
> >> +out:
> >> +	mutex_unlock(&iommu->lock);
> >> +#endif  
> > 
> > Having a find_dma that accepts a type and a remove_reserved here seems
> > like it might simplify things.
> >   
> >> +}
> >> +
> >>  static int vfio_bus_type(struct device *dev, void *data)
> >>  {
> >>  	struct bus_type **bus = data;
> >> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
> >>  	struct vfio_group *group, *group_tmp;
> >>  
> >>  	vfio_iommu_unmap_unpin_all(iommu);
> >> +	vfio_unmap_reserved(iommu);  
> > 
> > If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
> > need to handle reserved entries?  We might as well have a separate
> > vfio_remove_reserved_dma().
> >   
> >>  
> >>  	list_for_each_entry_safe(domain, domain_tmp,
> >>  				 &iommu->domain_list, next) {
> >> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
> >>  		struct vfio_iommu_type1_dma_map map;
> >>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
> >> -				VFIO_DMA_MAP_FLAG_WRITE;
> >> +				VFIO_DMA_MAP_FLAG_WRITE |
> >> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
> >>  
> >>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> >>  
> >> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>  		if (map.argsz < minsz || map.flags & ~mask)
> >>  			return -EINVAL;
> >>  
> >> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
> >> +			return vfio_register_reserved_iova_range(iommu, &map);
> >> +
> >>  		return vfio_dma_do_map(iommu, &map);
> >>  
> >>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> >> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>  		if (unmap.argsz < minsz || unmap.flags)
> >>  			return -EINVAL;
> >>  
> >> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
> >> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
> >> +			goto out;
> >> +		}
> >> +
> >>  		ret = vfio_dma_do_unmap(iommu, &unmap);
> >>  		if (ret)
> >>  			return ret;
> >>  
> >> +out:
> >>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>  			-EFAULT : 0;
> >>  	}
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 255a211..a49be8a 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
> >>   *
> >>   * Map process virtual addresses to IO virtual addresses using the
> >>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> >> + *
> >> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
> >> + * IOVA region which will be used on some platforms to map the host MSI frame.
> >> + * in that specific case, vaddr and prot are ignored. The requirement for
> >> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
> >> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
> >> + * MSI_RESERVED_IOVA region can be registered
> >>   */  
> > 
> > Why do we ignore read/write flags?  I'm not sure how useful a read-only
> > reserved region might be, but certainly some platforms might support
> > write-only or read-write.  Isn't this something we should let the IOMMU
> > driver decide?  ie. pass it down and let it fail or not?  
> OK Makes sense. Actually I am not very clear about whether this API is
> used for MSI binding only or likely to be used for something else.
> 
>   Also why are
> > we making it the API spec to only allow a single reserved region of
> > this type?  We could simply let additional ones fail, or better yet add
> > a capability to the info ioctl to indicate the number available and
> > then fail if the user exceeds it.  
> But this means that underneath we need to manage several iova_domains,
> right?
> >   
> >>  struct vfio_iommu_type1_dma_map {
> >>  	__u32	argsz;
> >>  	__u32	flags;
> >>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
> >>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> >> +/* reserved iova for MSI vectors*/
> >> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)  
> > 
> > nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
> > reserved flags seems like it puts the precedence in order.  
> OK
> >   
> >>  	__u64	vaddr;				/* Process virtual address */
> >>  	__u64	iova;				/* IO virtual address */
> >>  	__u64	size;				/* Size of mapping (bytes) */
> >> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
> >>   * Caller sets argsz.  The actual unmapped size is returned in the size
> >>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
> >>   * or size different from those used in the original mapping call will
> >> - * succeed.
> >> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
> >> + * flag set.  
> > 
> > So map/unmap become bi-modal, with this flag set they should only
> > operate on reserved entries, otherwise they should operate on legacy
> > entries.  So clearly as a user I should be able to continue doing an
> > unmap from 0-(-1) of legacy entries and not stumble over reserved
> > entries.  Thanks,  
> OK that's clear
> 
> Best Regards
> 
> Eric
> > 
> > Alex
> >   
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-07 18:29         ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-07 18:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 7 Apr 2016 15:43:29 +0200
Eric Auger <eric.auger@linaro.org> wrote:

> Hi Alex,
> On 04/07/2016 12:07 AM, Alex Williamson wrote:
> > On Mon,  4 Apr 2016 08:30:08 +0000
> > Eric Auger <eric.auger@linaro.org> wrote:
> >   
> >> The user is allowed to [un]register a reserved IOVA range by using the
> >> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
> >> It provides the base address and the size. This region is stored in the
> >> vfio_dma rb tree. At that point the iova range is not mapped to any target
> >> address yet. The host kernel will use those iova when needed, typically
> >> when the VFIO-PCI device allocates its MSIs.
> >>
> >> This patch also handles the destruction of the reserved binding RB-tree and
> >> domain's iova_domains.
> >>
> >> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> >> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
> >>
> >> ---
> >> v3 -> v4:
> >> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
> >> - protect vfio_register_reserved_iova_range implementation with
> >>   CONFIG_IOMMU_DMA_RESERVED
> >> - handle unregistration by user-space and on vfio_iommu_type1 release
> >>
> >> v1 -> v2:
> >> - set returned value according to alloc_reserved_iova_domain result
> >> - free the iova domains in case any error occurs
> >>
> >> RFC v1 -> v1:
> >> - takes into account Alex comments, based on
> >>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
> >> - use the existing dma map/unmap ioctl interface with a flag to register
> >>   a reserved IOVA range. A single reserved iova region is allowed.
> >>
> >> Conflicts:
> >> 	drivers/vfio/vfio_iommu_type1.c
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
> >>  include/uapi/linux/vfio.h       |  12 +++-
> >>  2 files changed, 150 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index c9ddbde..4497b20 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -36,6 +36,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/vfio.h>
> >>  #include <linux/workqueue.h>
> >> +#include <linux/dma-reserved-iommu.h>
> >>  
> >>  #define DRIVER_VERSION  "0.2"
> >>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> >> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>  	vfio_lock_acct(-unlocked);
> >>  }
> >>  
> >> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
> >> +{
> >> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >> +	struct vfio_domain *d;
> >> +
> >> +	list_for_each_entry(d, &iommu->domain_list, next)
> >> +		iommu_unmap_reserved(d->domain);
> >> +#endif
> >> +}
> >> +
> >>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>  {
> >>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
> >>  		vfio_unmap_unpin(iommu, dma);
> >> +	else
> >> +		vfio_unmap_reserved(iommu);
> >>  	vfio_unlink_dma(iommu, dma);
> >>  	kfree(dma);
> >>  }  
> > 
> > This makes me nervous, apparently we can add reserved mappings
> > individually, but we have absolutely no granularity on remove, so if we
> > remove one, we've removed them all even though we still have them
> > linked in our rb tree.  I see later that only one reserved region is
> > allowed, but that seems very short sighted, especially to impose that
> > on the user level API.  
> On kernel-size the reserved region is currently backed by a unique
> iova_domain. Do you mean you would like me to handle a list of
> iova_domains instead of using a single "cookie"?

TBH, I'm not really sure how this works with a single iova domain.  If
we have multiple irq chips and each gets mapped by a separate page in
the iova space, then is it really sufficient to do a lookup from the
irq_data to the msi_desc to the device to the domain in order to get a
reserved iova to map that msi doorbell?  Don't we need an iova from the
pool mapping the specific irqchip associated with our device?  The IOMMU
domain might span any number of irq chips, how can we assume there's
one only reserved iova space?  Maybe I'm not understanding how the code
works.

Conceptually, this is a generic IOMMU API extension to include reserved
iova space, MSI mappings are a consumer of that reserved iova pool but
I don't think we can say they will necessarily be the only consumer.
So building into the interface that there's only one is like making a
fixed length array to hold a string, it works for the initial
implementation, but it's not a robust solution.

I'm also trying to figure out how this maps to x86, the difference of
course being that for ARM you have a user specified, explicit MSI iova
space while x86 has an implicit MSI iova space.  So should x86 be
creating a reserved iova pool for the implicit mapping?  Should the
user have some way to query the mapping, whether implicit or explicit?
For instance, a new capability within the vfio iommu INFO ioctl might
expose reserved regions.  It might be initially present on x86 due to
the implicit nature of the reservation, while it might only appear on
ARM after submitting a reserved mapping.  Thanks,

Alex

 
> >> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  	 */
> >>  	if (iommu->v2) {
> >>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
> >> -		if (dma && dma->iova != unmap->iova) {
> >> +		if (dma && (dma->iova != unmap->iova ||
> >> +			   (dma->type == VFIO_IOVA_RESERVED))) {  
> > 
> > This seems unnecessary, won't the reserved entries fall out in the
> > while loop below?  
> yes that's correct
> >   
> >>  			ret = -EINVAL;
> >>  			goto unlock;
> >>  		}
> >> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  	}
> >>  
> >>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> >> +		if (dma->type == VFIO_IOVA_RESERVED) {
> >> +			ret = -EINVAL;
> >> +			goto unlock;
> >> +		}  
> > 
> > Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
> > size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
> > do that if they've registered any reserved regions.  Seems like maybe
> > we should ignore it and continue instead of abort, but then we need to
> > change the parameters of vfio_find_dma() to get it to move on, or pass
> > the type to the function, which would prevent us from getting here in
> > the first place.  
> OK I will rework this to match the existing use cases
> >   
> >>  		if (!iommu->v2 && unmap->iova > dma->iova)
> >>  			break;
> >>  		unmapped += dma->size;
> >> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  	return ret;
> >>  }
> >>  
> >> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
> >> +			   struct vfio_iommu_type1_dma_map *map)
> >> +{
> >> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >> +	dma_addr_t iova = map->iova;
> >> +	size_t size = map->size;
> >> +	uint64_t mask;
> >> +	struct vfio_dma *dma;
> >> +	int ret = 0;
> >> +	struct vfio_domain *d;
> >> +	unsigned long order;
> >> +
> >> +	/* Verify that none of our __u64 fields overflow */
> >> +	if (map->size != size || map->iova != iova)
> >> +		return -EINVAL;
> >> +
> >> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> >> +	mask = ((uint64_t)1 << order) - 1;
> >> +
> >> +	WARN_ON(mask & PAGE_MASK);
> >> +
> >> +	if (!size || (size | iova) & mask)
> >> +		return -EINVAL;
> >> +
> >> +	/* Don't allow IOVA address wrap */
> >> +	if (iova + size - 1 < iova)
> >> +		return -EINVAL;
> >> +
> >> +	mutex_lock(&iommu->lock);
> >> +
> >> +	if (vfio_find_dma(iommu, iova, size)) {
> >> +		ret =  -EEXIST;
> >> +		goto out;
> >> +	}
> >> +
> >> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> >> +	if (!dma) {
> >> +		ret = -ENOMEM;
> >> +		goto out;
> >> +	}
> >> +
> >> +	dma->iova = iova;
> >> +	dma->size = size;
> >> +	dma->type = VFIO_IOVA_RESERVED;
> >> +
> >> +	list_for_each_entry(d, &iommu->domain_list, next)
> >> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
> >> +							size, order);
> >> +
> >> +	if (ret) {
> >> +		list_for_each_entry(d, &iommu->domain_list, next)
> >> +			iommu_free_reserved_iova_domain(d->domain);
> >> +		goto out;
> >> +	}
> >> +
> >> +	vfio_link_dma(iommu, dma);
> >> +
> >> +out:
> >> +	mutex_unlock(&iommu->lock);
> >> +	return ret;
> >> +#else /* CONFIG_IOMMU_DMA_RESERVED */
> >> +	return -ENODEV;
> >> +#endif
> >> +}
> >> +
> >> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
> >> +				struct vfio_iommu_type1_dma_unmap *unmap)
> >> +{
> >> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >> +	dma_addr_t iova = unmap->iova;
> >> +	struct vfio_dma *dma;
> >> +	size_t size = unmap->size;
> >> +	uint64_t mask;
> >> +	unsigned long order;
> >> +
> >> +	/* Verify that none of our __u64 fields overflow */
> >> +	if (unmap->size != size || unmap->iova != iova)
> >> +		return;
> >> +
> >> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
> >> +	mask = ((uint64_t)1 << order) - 1;
> >> +
> >> +	WARN_ON(mask & PAGE_MASK);
> >> +
> >> +	if (!size || (size | iova) & mask)
> >> +		return;
> >> +
> >> +	/* Don't allow IOVA address wrap */
> >> +	if (iova + size - 1 < iova)
> >> +		return;
> >> +
> >> +	mutex_lock(&iommu->lock);
> >> +
> >> +	dma = vfio_find_dma(iommu, iova, size);
> >> +
> >> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
> >> +		unmap->size = 0;
> >> +		goto out;
> >> +	}
> >> +
> >> +	unmap->size =  dma->size;
> >> +	vfio_remove_dma(iommu, dma);
> >> +
> >> +out:
> >> +	mutex_unlock(&iommu->lock);
> >> +#endif  
> > 
> > Having a find_dma that accepts a type and a remove_reserved here seems
> > like it might simplify things.
> >   
> >> +}
> >> +
> >>  static int vfio_bus_type(struct device *dev, void *data)
> >>  {
> >>  	struct bus_type **bus = data;
> >> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
> >>  	struct vfio_group *group, *group_tmp;
> >>  
> >>  	vfio_iommu_unmap_unpin_all(iommu);
> >> +	vfio_unmap_reserved(iommu);  
> > 
> > If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
> > need to handle reserved entries?  We might as well have a separate
> > vfio_remove_reserved_dma().
> >   
> >>  
> >>  	list_for_each_entry_safe(domain, domain_tmp,
> >>  				 &iommu->domain_list, next) {
> >> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
> >>  		struct vfio_iommu_type1_dma_map map;
> >>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
> >> -				VFIO_DMA_MAP_FLAG_WRITE;
> >> +				VFIO_DMA_MAP_FLAG_WRITE |
> >> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
> >>  
> >>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> >>  
> >> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>  		if (map.argsz < minsz || map.flags & ~mask)
> >>  			return -EINVAL;
> >>  
> >> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
> >> +			return vfio_register_reserved_iova_range(iommu, &map);
> >> +
> >>  		return vfio_dma_do_map(iommu, &map);
> >>  
> >>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> >> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>  		if (unmap.argsz < minsz || unmap.flags)
> >>  			return -EINVAL;
> >>  
> >> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
> >> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
> >> +			goto out;
> >> +		}
> >> +
> >>  		ret = vfio_dma_do_unmap(iommu, &unmap);
> >>  		if (ret)
> >>  			return ret;
> >>  
> >> +out:
> >>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>  			-EFAULT : 0;
> >>  	}
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 255a211..a49be8a 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
> >>   *
> >>   * Map process virtual addresses to IO virtual addresses using the
> >>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> >> + *
> >> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
> >> + * IOVA region which will be used on some platforms to map the host MSI frame.
> >> + * in that specific case, vaddr and prot are ignored. The requirement for
> >> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
> >> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
> >> + * MSI_RESERVED_IOVA region can be registered
> >>   */  
> > 
> > Why do we ignore read/write flags?  I'm not sure how useful a read-only
> > reserved region might be, but certainly some platforms might support
> > write-only or read-write.  Isn't this something we should let the IOMMU
> > driver decide?  ie. pass it down and let it fail or not?  
> OK Makes sense. Actually I am not very clear about whether this API is
> used for MSI binding only or likely to be used for something else.
> 
>   Also why are
> > we making it the API spec to only allow a single reserved region of
> > this type?  We could simply let additional ones fail, or better yet add
> > a capability to the info ioctl to indicate the number available and
> > then fail if the user exceeds it.  
> But this means that underneath we need to manage several iova_domains,
> right?
> >   
> >>  struct vfio_iommu_type1_dma_map {
> >>  	__u32	argsz;
> >>  	__u32	flags;
> >>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
> >>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> >> +/* reserved iova for MSI vectors*/
> >> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)  
> > 
> > nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
> > reserved flags seems like it puts the precedence in order.  
> OK
> >   
> >>  	__u64	vaddr;				/* Process virtual address */
> >>  	__u64	iova;				/* IO virtual address */
> >>  	__u64	size;				/* Size of mapping (bytes) */
> >> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
> >>   * Caller sets argsz.  The actual unmapped size is returned in the size
> >>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
> >>   * or size different from those used in the original mapping call will
> >> - * succeed.
> >> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
> >> + * flag set.  
> > 
> > So map/unmap become bi-modal, with this flag set they should only
> > operate on reserved entries, otherwise they should operate on legacy
> > entries.  So clearly as a user I should be able to continue doing an
> > unmap from 0-(-1) of legacy entries and not stumble over reserved
> > entries.  Thanks,  
> OK that's clear
> 
> Best Regards
> 
> Eric
> > 
> > Alex
> >   
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 15:48           ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-08 15:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, robin.murphy, will.deacon, joro, tglx, jason,
	marc.zyngier, christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

Hi Alex,
On 04/07/2016 08:29 PM, Alex Williamson wrote:
> On Thu, 7 Apr 2016 15:43:29 +0200
> Eric Auger <eric.auger@linaro.org> wrote:
> 
>> Hi Alex,
>> On 04/07/2016 12:07 AM, Alex Williamson wrote:
>>> On Mon,  4 Apr 2016 08:30:08 +0000
>>> Eric Auger <eric.auger@linaro.org> wrote:
>>>   
>>>> The user is allowed to [un]register a reserved IOVA range by using the
>>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>>>> It provides the base address and the size. This region is stored in the
>>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>>>> address yet. The host kernel will use those iova when needed, typically
>>>> when the VFIO-PCI device allocates its MSIs.
>>>>
>>>> This patch also handles the destruction of the reserved binding RB-tree and
>>>> domain's iova_domains.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
>>>>
>>>> ---
>>>> v3 -> v4:
>>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>>>> - protect vfio_register_reserved_iova_range implementation with
>>>>   CONFIG_IOMMU_DMA_RESERVED
>>>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>>>
>>>> v1 -> v2:
>>>> - set returned value according to alloc_reserved_iova_domain result
>>>> - free the iova domains in case any error occurs
>>>>
>>>> RFC v1 -> v1:
>>>> - takes into account Alex comments, based on
>>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>>>> - use the existing dma map/unmap ioctl interface with a flag to register
>>>>   a reserved IOVA range. A single reserved iova region is allowed.
>>>>
>>>> Conflicts:
>>>> 	drivers/vfio/vfio_iommu_type1.c
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>>>  include/uapi/linux/vfio.h       |  12 +++-
>>>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index c9ddbde..4497b20 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -36,6 +36,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/vfio.h>
>>>>  #include <linux/workqueue.h>
>>>> +#include <linux/dma-reserved-iommu.h>
>>>>  
>>>>  #define DRIVER_VERSION  "0.2"
>>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>  	vfio_lock_acct(-unlocked);
>>>>  }
>>>>  
>>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	struct vfio_domain *d;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>> +		iommu_unmap_reserved(d->domain);
>>>> +#endif
>>>> +}
>>>> +
>>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>  {
>>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>>>  		vfio_unmap_unpin(iommu, dma);
>>>> +	else
>>>> +		vfio_unmap_reserved(iommu);
>>>>  	vfio_unlink_dma(iommu, dma);
>>>>  	kfree(dma);
>>>>  }  
>>>
>>> This makes me nervous, apparently we can add reserved mappings
>>> individually, but we have absolutely no granularity on remove, so if we
>>> remove one, we've removed them all even though we still have them
>>> linked in our rb tree.  I see later that only one reserved region is
>>> allowed, but that seems very short sighted, especially to impose that
>>> on the user level API.  
>> On kernel-size the reserved region is currently backed by a unique
>> iova_domain. Do you mean you would like me to handle a list of
>> iova_domains instead of using a single "cookie"?
> 
> TBH, I'm not really sure how this works with a single iova domain.  If
> we have multiple irq chips and each gets mapped by a separate page in
> the iova space, then is it really sufficient to do a lookup from the
> irq_data to the msi_desc to the device to the domain in order to get
> reserved iova to map that msi doorbell?  Don't we need an iova from the
> pool mapping the specific irqchip associated with our device?  The IOMMU
> domain might span any number of irq chips, how can we assume there's
> one only reserved iova space?  Maybe I'm not understanding how the code
> works.

On vfio_iommu_type1 we currently compute the reserved iova needs for
each domain and we take the max. Each domain then is assigned a reserved
iova domain of this max size.

So let's say domain1 has the largest needs (say 2 doorbells)
domain 1: iova domain size = 2
dev A --> doorbell 1
dev B -> doorbell 1
dev C -> doorbell 2
2 iova pages are used

domain 2: iova domain size = 2
dev D -> doorbell 1
1 iova page is used.

Do you see something wrong here?

> 
> Conceptually, this is a generic IOMMU API extension to include reserved
> iova space, MSI mappings are a consumer of that reserved iova pool but
> I don't think we can say they will necessarily be the only consumer.
> So building into the interface that there's only one is like making a
> fixed length array to hold a string, it works for the initial
> implementation, but it's not a robust solution.

I see. On the other hand the code is quite specific to MSI binding
problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
a teacup...

Best Regards

Eric
> 
> I'm also trying to figure out how this maps to x86, the difference of
> course being that for ARM you have a user specified, explicit MSI iova
> space while x86 has an implicit MSI iova space.  So should x86 be
> creating a reserved iova pool for the implicit mapping?  Should the
> user have some way to query the mapping, whether implicit or explicit?
> For instance, a new capability within the vfio iommu INFO ioctl might
> expose reserved regions.  It might be initially present on x86 due to
> the implicit nature of the reservation, while it might only appear on
> ARM after submitting a reserved mapping.  Thanks,
> 
> Alex
> 
>  
>>>> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  	 */
>>>>  	if (iommu->v2) {
>>>>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
>>>> -		if (dma && dma->iova != unmap->iova) {
>>>> +		if (dma && (dma->iova != unmap->iova ||
>>>> +			   (dma->type == VFIO_IOVA_RESERVED))) {  
>>>
>>> This seems unnecessary, won't the reserved entries fall out in the
>>> while loop below?  
>> yes that's correct
>>>   
>>>>  			ret = -EINVAL;
>>>>  			goto unlock;
>>>>  		}
>>>> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  	}
>>>>  
>>>>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
>>>> +		if (dma->type == VFIO_IOVA_RESERVED) {
>>>> +			ret = -EINVAL;
>>>> +			goto unlock;
>>>> +		}  
>>>
>>> Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
>>> size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
>>> do that if they've registered any reserved regions.  Seems like maybe
>>> we should ignore it and continue instead of abort, but then we need to
>>> change the parameters of vfio_find_dma() to get it to move on, or pass
>>> the type to the function, which would prevent us from getting here in
>>> the first place.  
>> OK I will rework this to match the existing use cases
>>>   
>>>>  		if (!iommu->v2 && unmap->iova > dma->iova)
>>>>  			break;
>>>>  		unmapped += dma->size;
>>>> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
>>>> +			   struct vfio_iommu_type1_dma_map *map)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	dma_addr_t iova = map->iova;
>>>> +	size_t size = map->size;
>>>> +	uint64_t mask;
>>>> +	struct vfio_dma *dma;
>>>> +	int ret = 0;
>>>> +	struct vfio_domain *d;
>>>> +	unsigned long order;
>>>> +
>>>> +	/* Verify that none of our __u64 fields overflow */
>>>> +	if (map->size != size || map->iova != iova)
>>>> +		return -EINVAL;
>>>> +
>>>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>>>> +	mask = ((uint64_t)1 << order) - 1;
>>>> +
>>>> +	WARN_ON(mask & PAGE_MASK);
>>>> +
>>>> +	if (!size || (size | iova) & mask)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Don't allow IOVA address wrap */
>>>> +	if (iova + size - 1 < iova)
>>>> +		return -EINVAL;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	if (vfio_find_dma(iommu, iova, size)) {
>>>> +		ret =  -EEXIST;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>>>> +	if (!dma) {
>>>> +		ret = -ENOMEM;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	dma->iova = iova;
>>>> +	dma->size = size;
>>>> +	dma->type = VFIO_IOVA_RESERVED;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
>>>> +							size, order);
>>>> +
>>>> +	if (ret) {
>>>> +		list_for_each_entry(d, &iommu->domain_list, next)
>>>> +			iommu_free_reserved_iova_domain(d->domain);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	vfio_link_dma(iommu, dma);
>>>> +
>>>> +out:
>>>> +	mutex_unlock(&iommu->lock);
>>>> +	return ret;
>>>> +#else /* CONFIG_IOMMU_DMA_RESERVED */
>>>> +	return -ENODEV;
>>>> +#endif
>>>> +}
>>>> +
>>>> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
>>>> +				struct vfio_iommu_type1_dma_unmap *unmap)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	dma_addr_t iova = unmap->iova;
>>>> +	struct vfio_dma *dma;
>>>> +	size_t size = unmap->size;
>>>> +	uint64_t mask;
>>>> +	unsigned long order;
>>>> +
>>>> +	/* Verify that none of our __u64 fields overflow */
>>>> +	if (unmap->size != size || unmap->iova != iova)
>>>> +		return;
>>>> +
>>>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>>>> +	mask = ((uint64_t)1 << order) - 1;
>>>> +
>>>> +	WARN_ON(mask & PAGE_MASK);
>>>> +
>>>> +	if (!size || (size | iova) & mask)
>>>> +		return;
>>>> +
>>>> +	/* Don't allow IOVA address wrap */
>>>> +	if (iova + size - 1 < iova)
>>>> +		return;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	dma = vfio_find_dma(iommu, iova, size);
>>>> +
>>>> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
>>>> +		unmap->size = 0;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	unmap->size =  dma->size;
>>>> +	vfio_remove_dma(iommu, dma);
>>>> +
>>>> +out:
>>>> +	mutex_unlock(&iommu->lock);
>>>> +#endif  
>>>
>>> Having a find_dma that accepts a type and a remove_reserved here seems
>>> like it might simplify things.
>>>   
>>>> +}
>>>> +
>>>>  static int vfio_bus_type(struct device *dev, void *data)
>>>>  {
>>>>  	struct bus_type **bus = data;
>>>> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>>>  	struct vfio_group *group, *group_tmp;
>>>>  
>>>>  	vfio_iommu_unmap_unpin_all(iommu);
>>>> +	vfio_unmap_reserved(iommu);  
>>>
>>> If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
>>> need to handle reserved entries?  We might as well have a separate
>>> vfio_remove_reserved_dma().
>>>   
>>>>  
>>>>  	list_for_each_entry_safe(domain, domain_tmp,
>>>>  				 &iommu->domain_list, next) {
>>>> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>>>>  		struct vfio_iommu_type1_dma_map map;
>>>>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
>>>> -				VFIO_DMA_MAP_FLAG_WRITE;
>>>> +				VFIO_DMA_MAP_FLAG_WRITE |
>>>> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>>>>  
>>>>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>>>  
>>>> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  		if (map.argsz < minsz || map.flags & ~mask)
>>>>  			return -EINVAL;
>>>>  
>>>> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
>>>> +			return vfio_register_reserved_iova_range(iommu, &map);
>>>> +
>>>>  		return vfio_dma_do_map(iommu, &map);
>>>>  
>>>>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>>>> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  		if (unmap.argsz < minsz || unmap.flags)
>>>>  			return -EINVAL;
>>>>  
>>>> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
>>>> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
>>>> +			goto out;
>>>> +		}
>>>> +
>>>>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>>>>  		if (ret)
>>>>  			return ret;
>>>>  
>>>> +out:
>>>>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>  			-EFAULT : 0;
>>>>  	}
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 255a211..a49be8a 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>>>>   *
>>>>   * Map process virtual addresses to IO virtual addresses using the
>>>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>>>> + *
>>>> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
>>>> + * IOVA region which will be used on some platforms to map the host MSI frame.
>>>> + * in that specific case, vaddr and prot are ignored. The requirement for
>>>> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
>>>> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
>>>> + * MSI_RESERVED_IOVA region can be registered
>>>>   */  
>>>
>>> Why do we ignore read/write flags?  I'm not sure how useful a read-only
>>> reserved region might be, but certainly some platforms might support
>>> write-only or read-write.  Isn't this something we should let the IOMMU
>>> driver decide?  ie. pass it down and let it fail or not?  
>> OK Makes sense. Actually I am not very clear about whether this API is
>> used for MSI binding only or likely to be used for something else.
>>
>>   Also why are
>>> we making it the API spec to only allow a single reserved region of
>>> this type?  We could simply let additional ones fail, or better yet add
>>> a capability to the info ioctl to indicate the number available and
>>> then fail if the user exceeds it.  
>> But this means that underneath we need to manage several iova_domains,
>> right?
>>>   
>>>>  struct vfio_iommu_type1_dma_map {
>>>>  	__u32	argsz;
>>>>  	__u32	flags;
>>>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>>>> +/* reserved iova for MSI vectors*/
>>>> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)  
>>>
>>> nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
>>> reserved flags seems like it puts the precedence in order.  
>> OK
>>>   
>>>>  	__u64	vaddr;				/* Process virtual address */
>>>>  	__u64	iova;				/* IO virtual address */
>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>>>>   * Caller sets argsz.  The actual unmapped size is returned in the size
>>>>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>>>>   * or size different from those used in the original mapping call will
>>>> - * succeed.
>>>> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
>>>> + * flag set.  
>>>
>>> So map/unmap become bi-modal, with this flag set they should only
>>> operate on reserved entries, otherwise they should operate on legacy
>>> entries.  So clearly as a user I should be able to continue doing an
>>> unmap from 0-(-1) of legacy entries and not stumble over reserved
>>> entries.  Thanks,  
>> OK that's clear
>>
>> Best Regards
>>
>> Eric
>>>
>>> Alex
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 15:48           ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-08 15:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: julien.grall-5wv7dgnIgG8, eric.auger-qxv4g6HH51o,
	jason-NLaQJdtUoK4Be96aLqz0jA, kvm-u79uwXL29TY76Z2rM5mHXA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, will.deacon-5wv7dgnIgG8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

Hi Alex,
On 04/07/2016 08:29 PM, Alex Williamson wrote:
> On Thu, 7 Apr 2016 15:43:29 +0200
> Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> 
>> Hi Alex,
>> On 04/07/2016 12:07 AM, Alex Williamson wrote:
>>> On Mon,  4 Apr 2016 08:30:08 +0000
>>> Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>   
>>>> The user is allowed to [un]register a reserved IOVA range by using the
>>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>>>> It provides the base address and the size. This region is stored in the
>>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>>>> address yet. The host kernel will use those iova when needed, typically
>>>> when the VFIO-PCI device allocates its MSIs.
>>>>
>>>> This patch also handles the destruction of the reserved binding RB-tree and
>>>> domain's iova_domains.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
>>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
>>>>
>>>> ---
>>>> v3 -> v4:
>>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>>>> - protect vfio_register_reserved_iova_range implementation with
>>>>   CONFIG_IOMMU_DMA_RESERVED
>>>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>>>
>>>> v1 -> v2:
>>>> - set returned value according to alloc_reserved_iova_domain result
>>>> - free the iova domains in case any error occurs
>>>>
>>>> RFC v1 -> v1:
>>>> - takes into account Alex comments, based on
>>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>>>> - use the existing dma map/unmap ioctl interface with a flag to register
>>>>   a reserved IOVA range. A single reserved iova region is allowed.
>>>>
>>>> Conflicts:
>>>> 	drivers/vfio/vfio_iommu_type1.c
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>>>  include/uapi/linux/vfio.h       |  12 +++-
>>>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index c9ddbde..4497b20 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -36,6 +36,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/vfio.h>
>>>>  #include <linux/workqueue.h>
>>>> +#include <linux/dma-reserved-iommu.h>
>>>>  
>>>>  #define DRIVER_VERSION  "0.2"
>>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
>>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>  	vfio_lock_acct(-unlocked);
>>>>  }
>>>>  
>>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	struct vfio_domain *d;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>> +		iommu_unmap_reserved(d->domain);
>>>> +#endif
>>>> +}
>>>> +
>>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>  {
>>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>>>  		vfio_unmap_unpin(iommu, dma);
>>>> +	else
>>>> +		vfio_unmap_reserved(iommu);
>>>>  	vfio_unlink_dma(iommu, dma);
>>>>  	kfree(dma);
>>>>  }  
>>>
>>> This makes me nervous, apparently we can add reserved mappings
>>> individually, but we have absolutely no granularity on remove, so if we
>>> remove one, we've removed them all even though we still have them
>>> linked in our rb tree.  I see later that only one reserved region is
>>> allowed, but that seems very short sighted, especially to impose that
>>> on the user level API.  
>> On kernel-size the reserved region is currently backed by a unique
>> iova_domain. Do you mean you would like me to handle a list of
>> iova_domains instead of using a single "cookie"?
> 
> TBH, I'm not really sure how this works with a single iova domain.  If
> we have multiple irq chips and each gets mapped by a separate page in
> the iova space, then is it really sufficient to do a lookup from the
> irq_data to the msi_desc to the device to the domain in order to get
> reserved iova to map that msi doorbell?  Don't we need an iova from the
> pool mapping the specific irqchip associated with our device?  The IOMMU
> domain might span any number of irq chips, how can we assume there's
> one only reserved iova space?  Maybe I'm not understanding how the code
> works.

On vfio_iommu_type1 we currently compute the reserved iova needs for
each domain and we take the max. Each domain then is assigned a reserved
iova domain of this max size.

So let's say domain1 has the largest needs (say 2 doorbells)
domain 1: iova domain size = 2
dev A --> doorbell 1
dev B -> doorbell 1
dev C -> doorbell 2
2 iova pages are used

domain 2: iova domain size = 2
dev D -> doorbell 1
1 iova page is used.

Do you see something wrong here?

> 
> Conceptually, this is a generic IOMMU API extension to include reserved
> iova space, MSI mappings are a consumer of that reserved iova pool but
> I don't think we can say they will necessarily be the only consumer.
> So building into the interface that there's only one is like making a
> fixed length array to hold a string, it works for the initial
> implementation, but it's not a robust solution.

I see. On the other hand the code is quite specific to MSI binding
problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
a teacup...

Best Regards

Eric
> 
> I'm also trying to figure out how this maps to x86, the difference of
> course being that for ARM you have a user specified, explicit MSI iova
> space while x86 has an implicit MSI iova space.  So should x86 be
> creating a reserved iova pool for the implicit mapping?  Should the
> user have some way to query the mapping, whether implicit or explicit?
> For instance, a new capability within the vfio iommu INFO ioctl might
> expose reserved regions.  It might be initially present on x86 due to
> the implicit nature of the reservation, while it might only appear on
> ARM after submitting a reserved mapping.  Thanks,
> 
> Alex
> 
>  
>>>> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  	 */
>>>>  	if (iommu->v2) {
>>>>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
>>>> -		if (dma && dma->iova != unmap->iova) {
>>>> +		if (dma && (dma->iova != unmap->iova ||
>>>> +			   (dma->type == VFIO_IOVA_RESERVED))) {  
>>>
>>> This seems unnecessary, won't the reserved entries fall out in the
>>> while loop below?  
>> yes that's correct
>>>   
>>>>  			ret = -EINVAL;
>>>>  			goto unlock;
>>>>  		}
>>>> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  	}
>>>>  
>>>>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
>>>> +		if (dma->type == VFIO_IOVA_RESERVED) {
>>>> +			ret = -EINVAL;
>>>> +			goto unlock;
>>>> +		}  
>>>
>>> Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
>>> size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
>>> do that if they've registered any reserved regions.  Seems like maybe
>>> we should ignore it and continue instead of abort, but then we need to
>>> change the parameters of vfio_find_dma() to get it to move on, or pass
>>> the type to the function, which would prevent us from getting here in
>>> the first place.  
>> OK I will rework this to match the existing use cases
>>>   
>>>>  		if (!iommu->v2 && unmap->iova > dma->iova)
>>>>  			break;
>>>>  		unmapped += dma->size;
>>>> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
>>>> +			   struct vfio_iommu_type1_dma_map *map)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	dma_addr_t iova = map->iova;
>>>> +	size_t size = map->size;
>>>> +	uint64_t mask;
>>>> +	struct vfio_dma *dma;
>>>> +	int ret = 0;
>>>> +	struct vfio_domain *d;
>>>> +	unsigned long order;
>>>> +
>>>> +	/* Verify that none of our __u64 fields overflow */
>>>> +	if (map->size != size || map->iova != iova)
>>>> +		return -EINVAL;
>>>> +
>>>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>>>> +	mask = ((uint64_t)1 << order) - 1;
>>>> +
>>>> +	WARN_ON(mask & PAGE_MASK);
>>>> +
>>>> +	if (!size || (size | iova) & mask)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Don't allow IOVA address wrap */
>>>> +	if (iova + size - 1 < iova)
>>>> +		return -EINVAL;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	if (vfio_find_dma(iommu, iova, size)) {
>>>> +		ret =  -EEXIST;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>>>> +	if (!dma) {
>>>> +		ret = -ENOMEM;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	dma->iova = iova;
>>>> +	dma->size = size;
>>>> +	dma->type = VFIO_IOVA_RESERVED;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
>>>> +							size, order);
>>>> +
>>>> +	if (ret) {
>>>> +		list_for_each_entry(d, &iommu->domain_list, next)
>>>> +			iommu_free_reserved_iova_domain(d->domain);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	vfio_link_dma(iommu, dma);
>>>> +
>>>> +out:
>>>> +	mutex_unlock(&iommu->lock);
>>>> +	return ret;
>>>> +#else /* CONFIG_IOMMU_DMA_RESERVED */
>>>> +	return -ENODEV;
>>>> +#endif
>>>> +}
>>>> +
>>>> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
>>>> +				struct vfio_iommu_type1_dma_unmap *unmap)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	dma_addr_t iova = unmap->iova;
>>>> +	struct vfio_dma *dma;
>>>> +	size_t size = unmap->size;
>>>> +	uint64_t mask;
>>>> +	unsigned long order;
>>>> +
>>>> +	/* Verify that none of our __u64 fields overflow */
>>>> +	if (unmap->size != size || unmap->iova != iova)
>>>> +		return;
>>>> +
>>>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>>>> +	mask = ((uint64_t)1 << order) - 1;
>>>> +
>>>> +	WARN_ON(mask & PAGE_MASK);
>>>> +
>>>> +	if (!size || (size | iova) & mask)
>>>> +		return;
>>>> +
>>>> +	/* Don't allow IOVA address wrap */
>>>> +	if (iova + size - 1 < iova)
>>>> +		return;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	dma = vfio_find_dma(iommu, iova, size);
>>>> +
>>>> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
>>>> +		unmap->size = 0;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	unmap->size =  dma->size;
>>>> +	vfio_remove_dma(iommu, dma);
>>>> +
>>>> +out:
>>>> +	mutex_unlock(&iommu->lock);
>>>> +#endif  
>>>
>>> Having a find_dma that accepts a type and a remove_reserved here seems
>>> like it might simplify things.
>>>   
>>>> +}
>>>> +
>>>>  static int vfio_bus_type(struct device *dev, void *data)
>>>>  {
>>>>  	struct bus_type **bus = data;
>>>> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>>>  	struct vfio_group *group, *group_tmp;
>>>>  
>>>>  	vfio_iommu_unmap_unpin_all(iommu);
>>>> +	vfio_unmap_reserved(iommu);  
>>>
>>> If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
>>> need to handle reserved entries?  We might as well have a separate
>>> vfio_remove_reserved_dma().
>>>   
>>>>  
>>>>  	list_for_each_entry_safe(domain, domain_tmp,
>>>>  				 &iommu->domain_list, next) {
>>>> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>>>>  		struct vfio_iommu_type1_dma_map map;
>>>>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
>>>> -				VFIO_DMA_MAP_FLAG_WRITE;
>>>> +				VFIO_DMA_MAP_FLAG_WRITE |
>>>> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>>>>  
>>>>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>>>  
>>>> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  		if (map.argsz < minsz || map.flags & ~mask)
>>>>  			return -EINVAL;
>>>>  
>>>> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
>>>> +			return vfio_register_reserved_iova_range(iommu, &map);
>>>> +
>>>>  		return vfio_dma_do_map(iommu, &map);
>>>>  
>>>>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>>>> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  		if (unmap.argsz < minsz || unmap.flags)
>>>>  			return -EINVAL;
>>>>  
>>>> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
>>>> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
>>>> +			goto out;
>>>> +		}
>>>> +
>>>>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>>>>  		if (ret)
>>>>  			return ret;
>>>>  
>>>> +out:
>>>>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>  			-EFAULT : 0;
>>>>  	}
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 255a211..a49be8a 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>>>>   *
>>>>   * Map process virtual addresses to IO virtual addresses using the
>>>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>>>> + *
>>>> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
>>>> + * IOVA region which will be used on some platforms to map the host MSI frame.
>>>> + * in that specific case, vaddr and prot are ignored. The requirement for
>>>> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
>>>> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
>>>> + * MSI_RESERVED_IOVA region can be registered
>>>>   */  
>>>
>>> Why do we ignore read/write flags?  I'm not sure how useful a read-only
>>> reserved region might be, but certainly some platforms might support
>>> write-only or read-write.  Isn't this something we should let the IOMMU
>>> driver decide?  ie. pass it down and let it fail or not?  
>> OK Makes sense. Actually I am not very clear about whether this API is
>> used for MSI binding only or likely to be used for something else.
>>
>>   Also why are
>>> we making it the API spec to only allow a single reserved region of
>>> this type?  We could simply let additional ones fail, or better yet add
>>> a capability to the info ioctl to indicate the number available and
>>> then fail if the user exceeds it.  
>> But this means that underneath we need to manage several iova_domains,
>> right?
>>>   
>>>>  struct vfio_iommu_type1_dma_map {
>>>>  	__u32	argsz;
>>>>  	__u32	flags;
>>>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>>>> +/* reserved iova for MSI vectors*/
>>>> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)  
>>>
>>> nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
>>> reserved flags seems like it puts the precedence in order.  
>> OK
>>>   
>>>>  	__u64	vaddr;				/* Process virtual address */
>>>>  	__u64	iova;				/* IO virtual address */
>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>>>>   * Caller sets argsz.  The actual unmapped size is returned in the size
>>>>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>>>>   * or size different from those used in the original mapping call will
>>>> - * succeed.
>>>> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
>>>> + * flag set.  
>>>
>>> So map/unmap become bi-modal, with this flag set they should only
>>> operate on reserved entries, otherwise they should operate on legacy
>>> entries.  So clearly as a user I should be able to continue doing an
>>> unmap from 0-(-1) of legacy entries and not stumble over reserved
>>> entries.  Thanks,  
>> OK that's clear
>>
>> Best Regards
>>
>> Eric
>>>
>>> Alex
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 15:48           ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-08 15:48 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Alex,
On 04/07/2016 08:29 PM, Alex Williamson wrote:
> On Thu, 7 Apr 2016 15:43:29 +0200
> Eric Auger <eric.auger@linaro.org> wrote:
> 
>> Hi Alex,
>> On 04/07/2016 12:07 AM, Alex Williamson wrote:
>>> On Mon,  4 Apr 2016 08:30:08 +0000
>>> Eric Auger <eric.auger@linaro.org> wrote:
>>>   
>>>> The user is allowed to [un]register a reserved IOVA range by using the
>>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>>>> It provides the base address and the size. This region is stored in the
>>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>>>> address yet. The host kernel will use those iova when needed, typically
>>>> when the VFIO-PCI device allocates its MSIs.
>>>>
>>>> This patch also handles the destruction of the reserved binding RB-tree and
>>>> domain's iova_domains.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
>>>>
>>>> ---
>>>> v3 -> v4:
>>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>>>> - protect vfio_register_reserved_iova_range implementation with
>>>>   CONFIG_IOMMU_DMA_RESERVED
>>>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>>>
>>>> v1 -> v2:
>>>> - set returned value according to alloc_reserved_iova_domain result
>>>> - free the iova domains in case any error occurs
>>>>
>>>> RFC v1 -> v1:
>>>> - takes into account Alex comments, based on
>>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>>>> - use the existing dma map/unmap ioctl interface with a flag to register
>>>>   a reserved IOVA range. A single reserved iova region is allowed.
>>>>
>>>> Conflicts:
>>>> 	drivers/vfio/vfio_iommu_type1.c
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>>>  include/uapi/linux/vfio.h       |  12 +++-
>>>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index c9ddbde..4497b20 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -36,6 +36,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/vfio.h>
>>>>  #include <linux/workqueue.h>
>>>> +#include <linux/dma-reserved-iommu.h>
>>>>  
>>>>  #define DRIVER_VERSION  "0.2"
>>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>  	vfio_lock_acct(-unlocked);
>>>>  }
>>>>  
>>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	struct vfio_domain *d;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>> +		iommu_unmap_reserved(d->domain);
>>>> +#endif
>>>> +}
>>>> +
>>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>  {
>>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>>>  		vfio_unmap_unpin(iommu, dma);
>>>> +	else
>>>> +		vfio_unmap_reserved(iommu);
>>>>  	vfio_unlink_dma(iommu, dma);
>>>>  	kfree(dma);
>>>>  }  
>>>
>>> This makes me nervous, apparently we can add reserved mappings
>>> individually, but we have absolutely no granularity on remove, so if we
>>> remove one, we've removed them all even though we still have them
>>> linked in our rb tree.  I see later that only one reserved region is
>>> allowed, but that seems very short sighted, especially to impose that
>>> on the user level API.  
>> On kernel-size the reserved region is currently backed by a unique
>> iova_domain. Do you mean you would like me to handle a list of
>> iova_domains instead of using a single "cookie"?
> 
> TBH, I'm not really sure how this works with a single iova domain.  If
> we have multiple irq chips and each gets mapped by a separate page in
> the iova space, then is it really sufficient to do a lookup from the
> irq_data to the msi_desc to the device to the domain in order to get
> reserved iova to map that msi doorbell?  Don't we need an iova from the
> pool mapping the specific irqchip associated with our device?  The IOMMU
> domain might span any number of irq chips, how can we assume there's
> one only reserved iova space?  Maybe I'm not understanding how the code
> works.

On vfio_iommu_type1 we currently compute the reserved iova needs for
each domain and we take the max. Each domain then is assigned a reserved
iova domain of this max size.

So let's say domain1 has the largest needs (say 2 doorbells)
domain 1: iova domain size = 2
dev A --> doorbell 1
dev B -> doorbell 1
dev C -> doorbell 2
2 iova pages are used

domain 2: iova domain size = 2
dev D -> doorbell 1
1 iova page is used.

Do you see something wrong here?

> 
> Conceptually, this is a generic IOMMU API extension to include reserved
> iova space, MSI mappings are a consumer of that reserved iova pool but
> I don't think we can say they will necessarily be the only consumer.
> So building into the interface that there's only one is like making a
> fixed length array to hold a string, it works for the initial
> implementation, but it's not a robust solution.

I see. On the other hand the code is quite specific to MSI binding
problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
a teacup...

Best Regards

Eric
> 
> I'm also trying to figure out how this maps to x86, the difference of
> course being that for ARM you have a user specified, explicit MSI iova
> space while x86 has an implicit MSI iova space.  So should x86 be
> creating a reserved iova pool for the implicit mapping?  Should the
> user have some way to query the mapping, whether implicit or explicit?
> For instance, a new capability within the vfio iommu INFO ioctl might
> expose reserved regions.  It might be initially present on x86 due to
> the implicit nature of the reservation, while it might only appear on
> ARM after submitting a reserved mapping.  Thanks,
> 
> Alex
> 
>  
>>>> @@ -489,7 +502,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  	 */
>>>>  	if (iommu->v2) {
>>>>  		dma = vfio_find_dma(iommu, unmap->iova, 0);
>>>> -		if (dma && dma->iova != unmap->iova) {
>>>> +		if (dma && (dma->iova != unmap->iova ||
>>>> +			   (dma->type == VFIO_IOVA_RESERVED))) {  
>>>
>>> This seems unnecessary, won't the reserved entries fall out in the
>>> while loop below?  
>> yes that's correct
>>>   
>>>>  			ret = -EINVAL;
>>>>  			goto unlock;
>>>>  		}
>>>> @@ -501,6 +515,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  	}
>>>>  
>>>>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
>>>> +		if (dma->type == VFIO_IOVA_RESERVED) {
>>>> +			ret = -EINVAL;
>>>> +			goto unlock;
>>>> +		}  
>>>
>>> Hmm, API concerns here.  Previously a user could unmap from iova = 0 to
>>> size = 2^64 - 1 and expect all mappings to get cleared.  Now they can't
>>> do that if they've registered any reserved regions.  Seems like maybe
>>> we should ignore it and continue instead of abort, but then we need to
>>> change the parameters of vfio_find_dma() to get it to move on, or pass
>>> the type to the function, which would prevent us from getting here in
>>> the first place.  
>> OK I will rework this to match the existing use cases
>>>   
>>>>  		if (!iommu->v2 && unmap->iova > dma->iova)
>>>>  			break;
>>>>  		unmapped += dma->size;
>>>> @@ -650,6 +668,114 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
>>>> +			   struct vfio_iommu_type1_dma_map *map)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	dma_addr_t iova = map->iova;
>>>> +	size_t size = map->size;
>>>> +	uint64_t mask;
>>>> +	struct vfio_dma *dma;
>>>> +	int ret = 0;
>>>> +	struct vfio_domain *d;
>>>> +	unsigned long order;
>>>> +
>>>> +	/* Verify that none of our __u64 fields overflow */
>>>> +	if (map->size != size || map->iova != iova)
>>>> +		return -EINVAL;
>>>> +
>>>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>>>> +	mask = ((uint64_t)1 << order) - 1;
>>>> +
>>>> +	WARN_ON(mask & PAGE_MASK);
>>>> +
>>>> +	if (!size || (size | iova) & mask)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Don't allow IOVA address wrap */
>>>> +	if (iova + size - 1 < iova)
>>>> +		return -EINVAL;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	if (vfio_find_dma(iommu, iova, size)) {
>>>> +		ret =  -EEXIST;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>>>> +	if (!dma) {
>>>> +		ret = -ENOMEM;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	dma->iova = iova;
>>>> +	dma->size = size;
>>>> +	dma->type = VFIO_IOVA_RESERVED;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>> +		ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
>>>> +							size, order);
>>>> +
>>>> +	if (ret) {
>>>> +		list_for_each_entry(d, &iommu->domain_list, next)
>>>> +			iommu_free_reserved_iova_domain(d->domain);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	vfio_link_dma(iommu, dma);
>>>> +
>>>> +out:
>>>> +	mutex_unlock(&iommu->lock);
>>>> +	return ret;
>>>> +#else /* CONFIG_IOMMU_DMA_RESERVED */
>>>> +	return -ENODEV;
>>>> +#endif
>>>> +}
>>>> +
>>>> +static void vfio_unregister_reserved_iova_range(struct vfio_iommu *iommu,
>>>> +				struct vfio_iommu_type1_dma_unmap *unmap)
>>>> +{
>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>> +	dma_addr_t iova = unmap->iova;
>>>> +	struct vfio_dma *dma;
>>>> +	size_t size = unmap->size;
>>>> +	uint64_t mask;
>>>> +	unsigned long order;
>>>> +
>>>> +	/* Verify that none of our __u64 fields overflow */
>>>> +	if (unmap->size != size || unmap->iova != iova)
>>>> +		return;
>>>> +
>>>> +	order =  __ffs(vfio_pgsize_bitmap(iommu));
>>>> +	mask = ((uint64_t)1 << order) - 1;
>>>> +
>>>> +	WARN_ON(mask & PAGE_MASK);
>>>> +
>>>> +	if (!size || (size | iova) & mask)
>>>> +		return;
>>>> +
>>>> +	/* Don't allow IOVA address wrap */
>>>> +	if (iova + size - 1 < iova)
>>>> +		return;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	dma = vfio_find_dma(iommu, iova, size);
>>>> +
>>>> +	if (!dma || (dma->type != VFIO_IOVA_RESERVED)) {
>>>> +		unmap->size = 0;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	unmap->size =  dma->size;
>>>> +	vfio_remove_dma(iommu, dma);
>>>> +
>>>> +out:
>>>> +	mutex_unlock(&iommu->lock);
>>>> +#endif  
>>>
>>> Having a find_dma that accepts a type and a remove_reserved here seems
>>> like it might simplify things.
>>>   
>>>> +}
>>>> +
>>>>  static int vfio_bus_type(struct device *dev, void *data)
>>>>  {
>>>>  	struct bus_type **bus = data;
>>>> @@ -946,6 +1072,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>>>  	struct vfio_group *group, *group_tmp;
>>>>  
>>>>  	vfio_iommu_unmap_unpin_all(iommu);
>>>> +	vfio_unmap_reserved(iommu);  
>>>
>>> If we call vfio_unmap_reserved() here, then why does vfio_remove_dma()
>>> need to handle reserved entries?  We might as well have a separate
>>> vfio_remove_reserved_dma().
>>>   
>>>>  
>>>>  	list_for_each_entry_safe(domain, domain_tmp,
>>>>  				 &iommu->domain_list, next) {
>>>> @@ -1020,7 +1147,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
>>>>  		struct vfio_iommu_type1_dma_map map;
>>>>  		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
>>>> -				VFIO_DMA_MAP_FLAG_WRITE;
>>>> +				VFIO_DMA_MAP_FLAG_WRITE |
>>>> +				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
>>>>  
>>>>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>>>  
>>>> @@ -1030,6 +1158,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  		if (map.argsz < minsz || map.flags & ~mask)
>>>>  			return -EINVAL;
>>>>  
>>>> +		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
>>>> +			return vfio_register_reserved_iova_range(iommu, &map);
>>>> +
>>>>  		return vfio_dma_do_map(iommu, &map);
>>>>  
>>>>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>>>> @@ -1044,10 +1175,16 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>  		if (unmap.argsz < minsz || unmap.flags)
>>>>  			return -EINVAL;
>>>>  
>>>> +		if (unmap.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA) {
>>>> +			vfio_unregister_reserved_iova_range(iommu, &unmap);
>>>> +			goto out;
>>>> +		}
>>>> +
>>>>  		ret = vfio_dma_do_unmap(iommu, &unmap);
>>>>  		if (ret)
>>>>  			return ret;
>>>>  
>>>> +out:
>>>>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>  			-EFAULT : 0;
>>>>  	}
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 255a211..a49be8a 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -498,12 +498,21 @@ struct vfio_iommu_type1_info {
>>>>   *
>>>>   * Map process virtual addresses to IO virtual addresses using the
>>>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>>>> + *
>>>> + * In case MSI_RESERVED_IOVA flag is set, the API only aims at registering an
>>>> + * IOVA region which will be used on some platforms to map the host MSI frame.
>>>> + * in that specific case, vaddr and prot are ignored. The requirement for
>>>> + * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
>>>> + * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
>>>> + * MSI_RESERVED_IOVA region can be registered
>>>>   */  
>>>
>>> Why do we ignore read/write flags?  I'm not sure how useful a read-only
>>> reserved region might be, but certainly some platforms might support
>>> write-only or read-write.  Isn't this something we should let the IOMMU
>>> driver decide?  ie. pass it down and let it fail or not?  
>> OK Makes sense. Actually I am not very clear about whether this API is
>> used for MSI binding only or likely to be used for something else.
>>
>>   Also why are
>>> we making it the API spec to only allow a single reserved region of
>>> this type?  We could simply let additional ones fail, or better yet add
>>> a capability to the info ioctl to indicate the number available and
>>> then fail if the user exceeds it.  
>> But this means that underneath we need to manage several iova_domains,
>> right?
>>>   
>>>>  struct vfio_iommu_type1_dma_map {
>>>>  	__u32	argsz;
>>>>  	__u32	flags;
>>>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>>>> +/* reserved iova for MSI vectors*/
>>>> +#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)  
>>>
>>> nit, ...RESERVED_MSI_IOVA makes a tad more sense and if we add new
>>> reserved flags seems like it puts the precedence in order.  
>> OK
>>>   
>>>>  	__u64	vaddr;				/* Process virtual address */
>>>>  	__u64	iova;				/* IO virtual address */
>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>> @@ -519,7 +528,8 @@ struct vfio_iommu_type1_dma_map {
>>>>   * Caller sets argsz.  The actual unmapped size is returned in the size
>>>>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>>>>   * or size different from those used in the original mapping call will
>>>> - * succeed.
>>>> + * succeed. A Reserved DMA region must be unmapped with MSI_RESERVED_IOVA
>>>> + * flag set.  
>>>
>>> So map/unmap become bi-modal, with this flag set they should only
>>> operate on reserved entries, otherwise they should operate on legacy
>>> entries.  So clearly as a user I should be able to continue doing an
>>> unmap from 0-(-1) of legacy entries and not stumble over reserved
>>> entries.  Thanks,  
>> OK that's clear
>>
>> Best Regards
>>
>> Eric
>>>
>>> Alex
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 16:41             ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-08 16:41 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger, robin.murphy, will.deacon, joro, tglx, jason,
	marc.zyngier, christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

On Fri, 8 Apr 2016 17:48:01 +0200
Eric Auger <eric.auger@linaro.org> wrote:

Hi Eric,

> Hi Alex,
> On 04/07/2016 08:29 PM, Alex Williamson wrote:
> > On Thu, 7 Apr 2016 15:43:29 +0200
> > Eric Auger <eric.auger@linaro.org> wrote:
> >   
> >> Hi Alex,
> >> On 04/07/2016 12:07 AM, Alex Williamson wrote:  
> >>> On Mon,  4 Apr 2016 08:30:08 +0000
> >>> Eric Auger <eric.auger@linaro.org> wrote:
> >>>     
> >>>> The user is allowed to [un]register a reserved IOVA range by using the
> >>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
> >>>> It provides the base address and the size. This region is stored in the
> >>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
> >>>> address yet. The host kernel will use those iova when needed, typically
> >>>> when the VFIO-PCI device allocates its MSIs.
> >>>>
> >>>> This patch also handles the destruction of the reserved binding RB-tree and
> >>>> domain's iova_domains.
> >>>>
> >>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> >>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
> >>>>
> >>>> ---
> >>>> v3 -> v4:
> >>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
> >>>> - protect vfio_register_reserved_iova_range implementation with
> >>>>   CONFIG_IOMMU_DMA_RESERVED
> >>>> - handle unregistration by user-space and on vfio_iommu_type1 release
> >>>>
> >>>> v1 -> v2:
> >>>> - set returned value according to alloc_reserved_iova_domain result
> >>>> - free the iova domains in case any error occurs
> >>>>
> >>>> RFC v1 -> v1:
> >>>> - takes into account Alex comments, based on
> >>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
> >>>> - use the existing dma map/unmap ioctl interface with a flag to register
> >>>>   a reserved IOVA range. A single reserved iova region is allowed.
> >>>>
> >>>> Conflicts:
> >>>> 	drivers/vfio/vfio_iommu_type1.c
> >>>> ---
> >>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
> >>>>  include/uapi/linux/vfio.h       |  12 +++-
> >>>>  2 files changed, 150 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>>> index c9ddbde..4497b20 100644
> >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>> @@ -36,6 +36,7 @@
> >>>>  #include <linux/uaccess.h>
> >>>>  #include <linux/vfio.h>
> >>>>  #include <linux/workqueue.h>
> >>>> +#include <linux/dma-reserved-iommu.h>
> >>>>  
> >>>>  #define DRIVER_VERSION  "0.2"
> >>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> >>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>>>  	vfio_lock_acct(-unlocked);
> >>>>  }
> >>>>  
> >>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
> >>>> +{
> >>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >>>> +	struct vfio_domain *d;
> >>>> +
> >>>> +	list_for_each_entry(d, &iommu->domain_list, next)
> >>>> +		iommu_unmap_reserved(d->domain);
> >>>> +#endif
> >>>> +}
> >>>> +
> >>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>>>  {
> >>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
> >>>>  		vfio_unmap_unpin(iommu, dma);
> >>>> +	else
> >>>> +		vfio_unmap_reserved(iommu);
> >>>>  	vfio_unlink_dma(iommu, dma);
> >>>>  	kfree(dma);
> >>>>  }    
> >>>
> >>> This makes me nervous, apparently we can add reserved mappings
> >>> individually, but we have absolutely no granularity on remove, so if we
> >>> remove one, we've removed them all even though we still have them
> >>> linked in our rb tree.  I see later that only one reserved region is
> >>> allowed, but that seems very short sighted, especially to impose that
> >>> on the user level API.    
> >> On kernel-size the reserved region is currently backed by a unique
> >> iova_domain. Do you mean you would like me to handle a list of
> >> iova_domains instead of using a single "cookie"?  
> > 
> > TBH, I'm not really sure how this works with a single iova domain.  If
> > we have multiple irq chips and each gets mapped by a separate page in
> > the iova space, then is it really sufficient to do a lookup from the
> > irq_data to the msi_desc to the device to the domain in order to get
> > reserved iova to map that msi doorbell?  Don't we need an iova from the
> > pool mapping the specific irqchip associated with our device?  The IOMMU
> > domain might span any number of irq chips, how can we assume there's
> > one only reserved iova space?  Maybe I'm not understanding how the code
> > works.  
> 
> On vfio_iommu_type1 we currently compute the reserved iova needs for
> each domain and we take the max. Each domain then is assigned a reserved
> iova domain of this max size.
> 
> So let's say domain1 has the largest needs (say 2 doorbells)
> domain 1: iova domain size = 2
> dev A --> doorbell 1
> dev B -> doorbell 1
> dev C -> doorbell 2
> 2 iova pages are used
> 
> domain 2: iova domain size = 2
> dev D -> doorbell 1
> 1 iova page is used.
> 
> Do you see something wrong here?

Can we really know the maximum reserved iova space for a domain?  It
seems like this depends on the current composition of the domain, so it
could change as devices are added to the domain.  Or perhaps the
maximum is based on a maximally configured domain, but even then the
system itself may be expandable so it might need to account for an
architectural maximum.  A user like QEMU would likely have an easier
time dealing with an absolute maximum than a current maximum.  Maybe a
single range would be sufficient under those conditions.

> > Conceptually, this is a generic IOMMU API extension to include reserved
> > ioor va space, MSI mappings are a consumer of that reserved iova pool but
> > I don't think we can say they will necessarily be the only consumer.
> > So building into the interface that there's only one is like making a
> > fixed length array to hold a string, it works for the initial
> > implementation, but it's not a robust solution.  
> 
> I see. On the other hand the code is quite specific to MSI binding
> problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
> a teacup...

For the vfio api, the interface is already specific to MSI, so that
seems reasonable.  I'd still rather expose somehow to the user that
only a single reserved MSI region is supported, even if that's all the
implementation can handle, just so we have the option to expand that in
the future.  The iommu api is internal, so we can expand it as we go, i
just want to be sure to raise the issue even if we think the
restrictions are sufficient for now.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 16:41             ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-08 16:41 UTC (permalink / raw)
  To: Eric Auger
  Cc: julien.grall-5wv7dgnIgG8, eric.auger-qxv4g6HH51o,
	jason-NLaQJdtUoK4Be96aLqz0jA, kvm-u79uwXL29TY76Z2rM5mHXA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, will.deacon-5wv7dgnIgG8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

On Fri, 8 Apr 2016 17:48:01 +0200
Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:

Hi Eric,

> Hi Alex,
> On 04/07/2016 08:29 PM, Alex Williamson wrote:
> > On Thu, 7 Apr 2016 15:43:29 +0200
> > Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> >   
> >> Hi Alex,
> >> On 04/07/2016 12:07 AM, Alex Williamson wrote:  
> >>> On Mon,  4 Apr 2016 08:30:08 +0000
> >>> Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> >>>     
> >>>> The user is allowed to [un]register a reserved IOVA range by using the
> >>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
> >>>> It provides the base address and the size. This region is stored in the
> >>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
> >>>> address yet. The host kernel will use those iova when needed, typically
> >>>> when the VFIO-PCI device allocates its MSIs.
> >>>>
> >>>> This patch also handles the destruction of the reserved binding RB-tree and
> >>>> domain's iova_domains.
> >>>>
> >>>> Signed-off-by: Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> >>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
> >>>>
> >>>> ---
> >>>> v3 -> v4:
> >>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
> >>>> - protect vfio_register_reserved_iova_range implementation with
> >>>>   CONFIG_IOMMU_DMA_RESERVED
> >>>> - handle unregistration by user-space and on vfio_iommu_type1 release
> >>>>
> >>>> v1 -> v2:
> >>>> - set returned value according to alloc_reserved_iova_domain result
> >>>> - free the iova domains in case any error occurs
> >>>>
> >>>> RFC v1 -> v1:
> >>>> - takes into account Alex comments, based on
> >>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
> >>>> - use the existing dma map/unmap ioctl interface with a flag to register
> >>>>   a reserved IOVA range. A single reserved iova region is allowed.
> >>>>
> >>>> Conflicts:
> >>>> 	drivers/vfio/vfio_iommu_type1.c
> >>>> ---
> >>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
> >>>>  include/uapi/linux/vfio.h       |  12 +++-
> >>>>  2 files changed, 150 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>>> index c9ddbde..4497b20 100644
> >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>> @@ -36,6 +36,7 @@
> >>>>  #include <linux/uaccess.h>
> >>>>  #include <linux/vfio.h>
> >>>>  #include <linux/workqueue.h>
> >>>> +#include <linux/dma-reserved-iommu.h>
> >>>>  
> >>>>  #define DRIVER_VERSION  "0.2"
> >>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
> >>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>>>  	vfio_lock_acct(-unlocked);
> >>>>  }
> >>>>  
> >>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
> >>>> +{
> >>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >>>> +	struct vfio_domain *d;
> >>>> +
> >>>> +	list_for_each_entry(d, &iommu->domain_list, next)
> >>>> +		iommu_unmap_reserved(d->domain);
> >>>> +#endif
> >>>> +}
> >>>> +
> >>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>>>  {
> >>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
> >>>>  		vfio_unmap_unpin(iommu, dma);
> >>>> +	else
> >>>> +		vfio_unmap_reserved(iommu);
> >>>>  	vfio_unlink_dma(iommu, dma);
> >>>>  	kfree(dma);
> >>>>  }    
> >>>
> >>> This makes me nervous, apparently we can add reserved mappings
> >>> individually, but we have absolutely no granularity on remove, so if we
> >>> remove one, we've removed them all even though we still have them
> >>> linked in our rb tree.  I see later that only one reserved region is
> >>> allowed, but that seems very short sighted, especially to impose that
> >>> on the user level API.    
> >> On kernel-size the reserved region is currently backed by a unique
> >> iova_domain. Do you mean you would like me to handle a list of
> >> iova_domains instead of using a single "cookie"?  
> > 
> > TBH, I'm not really sure how this works with a single iova domain.  If
> > we have multiple irq chips and each gets mapped by a separate page in
> > the iova space, then is it really sufficient to do a lookup from the
> > irq_data to the msi_desc to the device to the domain in order to get
> > reserved iova to map that msi doorbell?  Don't we need an iova from the
> > pool mapping the specific irqchip associated with our device?  The IOMMU
> > domain might span any number of irq chips, how can we assume there's
> > one only reserved iova space?  Maybe I'm not understanding how the code
> > works.  
> 
> On vfio_iommu_type1 we currently compute the reserved iova needs for
> each domain and we take the max. Each domain then is assigned a reserved
> iova domain of this max size.
> 
> So let's say domain1 has the largest needs (say 2 doorbells)
> domain 1: iova domain size = 2
> dev A --> doorbell 1
> dev B -> doorbell 1
> dev C -> doorbell 2
> 2 iova pages are used
> 
> domain 2: iova domain size = 2
> dev D -> doorbell 1
> 1 iova page is used.
> 
> Do you see something wrong here?

Can we really know the maximum reserved iova space for a domain?  It
seems like this depends on the current composition of the domain, so it
could change as devices are added to the domain.  Or perhaps the
maximum is based on a maximally configured domain, but even then the
system itself may be expandable so it might need to account for an
architectural maximum.  A user like QEMU would likely have an easier
time dealing with an absolute maximum than a current maximum.  Maybe a
single range would be sufficient under those conditions.

> > Conceptually, this is a generic IOMMU API extension to include reserved
> > ioor va space, MSI mappings are a consumer of that reserved iova pool but
> > I don't think we can say they will necessarily be the only consumer.
> > So building into the interface that there's only one is like making a
> > fixed length array to hold a string, it works for the initial
> > implementation, but it's not a robust solution.  
> 
> I see. On the other hand the code is quite specific to MSI binding
> problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
> a teacup...

For the vfio api, the interface is already specific to MSI, so that
seems reasonable.  I'd still rather expose somehow to the user that
only a single reserved MSI region is supported, even if that's all the
implementation can handle, just so we have the option to expand that in
the future.  The iommu api is internal, so we can expand it as we go, i
just want to be sure to raise the issue even if we think the
restrictions are sufficient for now.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 16:41             ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2016-04-08 16:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 8 Apr 2016 17:48:01 +0200
Eric Auger <eric.auger@linaro.org> wrote:

Hi Eric,

> Hi Alex,
> On 04/07/2016 08:29 PM, Alex Williamson wrote:
> > On Thu, 7 Apr 2016 15:43:29 +0200
> > Eric Auger <eric.auger@linaro.org> wrote:
> >   
> >> Hi Alex,
> >> On 04/07/2016 12:07 AM, Alex Williamson wrote:  
> >>> On Mon,  4 Apr 2016 08:30:08 +0000
> >>> Eric Auger <eric.auger@linaro.org> wrote:
> >>>     
> >>>> The user is allowed to [un]register a reserved IOVA range by using the
> >>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
> >>>> It provides the base address and the size. This region is stored in the
> >>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
> >>>> address yet. The host kernel will use those iova when needed, typically
> >>>> when the VFIO-PCI device allocates its MSIs.
> >>>>
> >>>> This patch also handles the destruction of the reserved binding RB-tree and
> >>>> domain's iova_domains.
> >>>>
> >>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
> >>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
> >>>>
> >>>> ---
> >>>> v3 -> v4:
> >>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
> >>>> - protect vfio_register_reserved_iova_range implementation with
> >>>>   CONFIG_IOMMU_DMA_RESERVED
> >>>> - handle unregistration by user-space and on vfio_iommu_type1 release
> >>>>
> >>>> v1 -> v2:
> >>>> - set returned value according to alloc_reserved_iova_domain result
> >>>> - free the iova domains in case any error occurs
> >>>>
> >>>> RFC v1 -> v1:
> >>>> - takes into account Alex comments, based on
> >>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
> >>>> - use the existing dma map/unmap ioctl interface with a flag to register
> >>>>   a reserved IOVA range. A single reserved iova region is allowed.
> >>>>
> >>>> Conflicts:
> >>>> 	drivers/vfio/vfio_iommu_type1.c
> >>>> ---
> >>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
> >>>>  include/uapi/linux/vfio.h       |  12 +++-
> >>>>  2 files changed, 150 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>>> index c9ddbde..4497b20 100644
> >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>> @@ -36,6 +36,7 @@
> >>>>  #include <linux/uaccess.h>
> >>>>  #include <linux/vfio.h>
> >>>>  #include <linux/workqueue.h>
> >>>> +#include <linux/dma-reserved-iommu.h>
> >>>>  
> >>>>  #define DRIVER_VERSION  "0.2"
> >>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> >>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>>>  	vfio_lock_acct(-unlocked);
> >>>>  }
> >>>>  
> >>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
> >>>> +{
> >>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
> >>>> +	struct vfio_domain *d;
> >>>> +
> >>>> +	list_for_each_entry(d, &iommu->domain_list, next)
> >>>> +		iommu_unmap_reserved(d->domain);
> >>>> +#endif
> >>>> +}
> >>>> +
> >>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>>>  {
> >>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
> >>>>  		vfio_unmap_unpin(iommu, dma);
> >>>> +	else
> >>>> +		vfio_unmap_reserved(iommu);
> >>>>  	vfio_unlink_dma(iommu, dma);
> >>>>  	kfree(dma);
> >>>>  }    
> >>>
> >>> This makes me nervous, apparently we can add reserved mappings
> >>> individually, but we have absolutely no granularity on remove, so if we
> >>> remove one, we've removed them all even though we still have them
> >>> linked in our rb tree.  I see later that only one reserved region is
> >>> allowed, but that seems very short sighted, especially to impose that
> >>> on the user level API.    
> >> On kernel-size the reserved region is currently backed by a unique
> >> iova_domain. Do you mean you would like me to handle a list of
> >> iova_domains instead of using a single "cookie"?  
> > 
> > TBH, I'm not really sure how this works with a single iova domain.  If
> > we have multiple irq chips and each gets mapped by a separate page in
> > the iova space, then is it really sufficient to do a lookup from the
> > irq_data to the msi_desc to the device to the domain in order to get
> > reserved iova to map that msi doorbell?  Don't we need an iova from the
> > pool mapping the specific irqchip associated with our device?  The IOMMU
> > domain might span any number of irq chips, how can we assume there's
> > one only reserved iova space?  Maybe I'm not understanding how the code
> > works.  
> 
> On vfio_iommu_type1 we currently compute the reserved iova needs for
> each domain and we take the max. Each domain then is assigned a reserved
> iova domain of this max size.
> 
> So let's say domain1 has the largest needs (say 2 doorbells)
> domain 1: iova domain size = 2
> dev A --> doorbell 1
> dev B -> doorbell 1
> dev C -> doorbell 2
> 2 iova pages are used
> 
> domain 2: iova domain size = 2
> dev D -> doorbell 1
> 1 iova page is used.
> 
> Do you see something wrong here?

Can we really know the maximum reserved iova space for a domain?  It
seems like this depends on the current composition of the domain, so it
could change as devices are added to the domain.  Or perhaps the
maximum is based on a maximally configured domain, but even then the
system itself may be expandable so it might need to account for an
architectural maximum.  A user like QEMU would likely have an easier
time dealing with an absolute maximum than a current maximum.  Maybe a
single range would be sufficient under those conditions.

> > Conceptually, this is a generic IOMMU API extension to include reserved
> > ioor va space, MSI mappings are a consumer of that reserved iova pool but
> > I don't think we can say they will necessarily be the only consumer.
> > So building into the interface that there's only one is like making a
> > fixed length array to hold a string, it works for the initial
> > implementation, but it's not a robust solution.  
> 
> I see. On the other hand the code is quite specific to MSI binding
> problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
> a teacup...

For the vfio api, the interface is already specific to MSI, so that
seems reasonable.  I'd still rather expose somehow to the user that
only a single reserved MSI region is supported, even if that's all the
implementation can handle, just so we have the option to expand that in
the future.  The iommu api is internal, so we can expand it as we go, i
just want to be sure to raise the issue even if we think the
restrictions are sufficient for now.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 16:57               ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-08 16:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, robin.murphy, will.deacon, joro, tglx, jason,
	marc.zyngier, christoffer.dall, linux-arm-kernel, kvmarm, kvm,
	suravee.suthikulpanit, patches, linux-kernel, Manish.Jaggi,
	Bharat.Bhushan, pranav.sawargaonkar, p.fedin, iommu,
	Jean-Philippe.Brucker, julien.grall

Hi Alex,
On 04/08/2016 06:41 PM, Alex Williamson wrote:
> On Fri, 8 Apr 2016 17:48:01 +0200
> Eric Auger <eric.auger@linaro.org> wrote:
> 
> Hi Eric,
> 
>> Hi Alex,
>> On 04/07/2016 08:29 PM, Alex Williamson wrote:
>>> On Thu, 7 Apr 2016 15:43:29 +0200
>>> Eric Auger <eric.auger@linaro.org> wrote:
>>>   
>>>> Hi Alex,
>>>> On 04/07/2016 12:07 AM, Alex Williamson wrote:  
>>>>> On Mon,  4 Apr 2016 08:30:08 +0000
>>>>> Eric Auger <eric.auger@linaro.org> wrote:
>>>>>     
>>>>>> The user is allowed to [un]register a reserved IOVA range by using the
>>>>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>>>>>> It provides the base address and the size. This region is stored in the
>>>>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>>>>>> address yet. The host kernel will use those iova when needed, typically
>>>>>> when the VFIO-PCI device allocates its MSIs.
>>>>>>
>>>>>> This patch also handles the destruction of the reserved binding RB-tree and
>>>>>> domain's iova_domains.
>>>>>>
>>>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
>>>>>>
>>>>>> ---
>>>>>> v3 -> v4:
>>>>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>>>>>> - protect vfio_register_reserved_iova_range implementation with
>>>>>>   CONFIG_IOMMU_DMA_RESERVED
>>>>>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>>>>>
>>>>>> v1 -> v2:
>>>>>> - set returned value according to alloc_reserved_iova_domain result
>>>>>> - free the iova domains in case any error occurs
>>>>>>
>>>>>> RFC v1 -> v1:
>>>>>> - takes into account Alex comments, based on
>>>>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>>>>>> - use the existing dma map/unmap ioctl interface with a flag to register
>>>>>>   a reserved IOVA range. A single reserved iova region is allowed.
>>>>>>
>>>>>> Conflicts:
>>>>>> 	drivers/vfio/vfio_iommu_type1.c
>>>>>> ---
>>>>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>>>>>  include/uapi/linux/vfio.h       |  12 +++-
>>>>>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>> index c9ddbde..4497b20 100644
>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>> @@ -36,6 +36,7 @@
>>>>>>  #include <linux/uaccess.h>
>>>>>>  #include <linux/vfio.h>
>>>>>>  #include <linux/workqueue.h>
>>>>>> +#include <linux/dma-reserved-iommu.h>
>>>>>>  
>>>>>>  #define DRIVER_VERSION  "0.2"
>>>>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>>>>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>>>  	vfio_lock_acct(-unlocked);
>>>>>>  }
>>>>>>  
>>>>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>>>>>> +{
>>>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>>>> +	struct vfio_domain *d;
>>>>>> +
>>>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>>>> +		iommu_unmap_reserved(d->domain);
>>>>>> +#endif
>>>>>> +}
>>>>>> +
>>>>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>>>  {
>>>>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>>>>>  		vfio_unmap_unpin(iommu, dma);
>>>>>> +	else
>>>>>> +		vfio_unmap_reserved(iommu);
>>>>>>  	vfio_unlink_dma(iommu, dma);
>>>>>>  	kfree(dma);
>>>>>>  }    
>>>>>
>>>>> This makes me nervous, apparently we can add reserved mappings
>>>>> individually, but we have absolutely no granularity on remove, so if we
>>>>> remove one, we've removed them all even though we still have them
>>>>> linked in our rb tree.  I see later that only one reserved region is
>>>>> allowed, but that seems very short sighted, especially to impose that
>>>>> on the user level API.    
>>>> On kernel-size the reserved region is currently backed by a unique
>>>> iova_domain. Do you mean you would like me to handle a list of
>>>> iova_domains instead of using a single "cookie"?  
>>>
>>> TBH, I'm not really sure how this works with a single iova domain.  If
>>> we have multiple irq chips and each gets mapped by a separate page in
>>> the iova space, then is it really sufficient to do a lookup from the
>>> irq_data to the msi_desc to the device to the domain in order to get
>>> reserved iova to map that msi doorbell?  Don't we need an iova from the
>>> pool mapping the specific irqchip associated with our device?  The IOMMU
>>> domain might span any number of irq chips, how can we assume there's
>>> one only reserved iova space?  Maybe I'm not understanding how the code
>>> works.  
>>
>> On vfio_iommu_type1 we currently compute the reserved iova needs for
>> each domain and we take the max. Each domain then is assigned a reserved
>> iova domain of this max size.
>>
>> So let's say domain1 has the largest needs (say 2 doorbells)
>> domain 1: iova domain size = 2
>> dev A --> doorbell 1
>> dev B -> doorbell 1
>> dev C -> doorbell 2
>> 2 iova pages are used
>>
>> domain 2: iova domain size = 2
>> dev D -> doorbell 1
>> 1 iova page is used.
>>
>> Do you see something wrong here?
> 
> Can we really know the maximum reserved iova space for a domain?  It
> seems like this depends on the current composition of the domain, so it
> could change as devices are added to the domain.  Or perhaps the
> maximum is based on a maximally configured domain, but even then the
> system itself may be expandable so it might need to account for an
> architectural maximum.  A user like QEMU would likely have an easier
> time dealing with an absolute maximum than a current maximum.  Maybe a
> single range would be sufficient under those conditions.

yes definitively if the domain evolves we may need to extend the
reserved iova domain. Also dealing with an arbitary absolute maximum is
much easier to integrate on (QEMU) user side.

> 
>>> Conceptually, this is a generic IOMMU API extension to include reserved
>>> ioor va space, MSI mappings are a consumer of that reserved iova pool but
>>> I don't think we can say they will necessarily be the only consumer.
>>> So building into the interface that there's only one is like making a
>>> fixed length array to hold a string, it works for the initial
>>> implementation, but it's not a robust solution.  
>>
>> I see. On the other hand the code is quite specific to MSI binding
>> problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
>> a teacup...
> 
> For the vfio api, the interface is already specific to MSI, so that
> seems reasonable.  I'd still rather expose somehow to the user that
> only a single reserved MSI region is supported, even if that's all the
> implementation can handle, just so we have the option to expand that in
> the future.  The iommu api is internal, so we can expand it as we go, i
> just want to be sure to raise the issue even if we think the
> restrictions are sufficient for now.  Thanks,

OK I agree. I Will change the doc/code accordingly.

Best Regards

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 16:57               ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-08 16:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: julien.grall-5wv7dgnIgG8, eric.auger-qxv4g6HH51o,
	jason-NLaQJdtUoK4Be96aLqz0jA, kvm-u79uwXL29TY76Z2rM5mHXA,
	patches-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
	p.fedin-Sze3O3UU22JBDgjK7y7TUQ, will.deacon-5wv7dgnIgG8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Manish.Jaggi-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A

Hi Alex,
On 04/08/2016 06:41 PM, Alex Williamson wrote:
> On Fri, 8 Apr 2016 17:48:01 +0200
> Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> 
> Hi Eric,
> 
>> Hi Alex,
>> On 04/07/2016 08:29 PM, Alex Williamson wrote:
>>> On Thu, 7 Apr 2016 15:43:29 +0200
>>> Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>   
>>>> Hi Alex,
>>>> On 04/07/2016 12:07 AM, Alex Williamson wrote:  
>>>>> On Mon,  4 Apr 2016 08:30:08 +0000
>>>>> Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>>>     
>>>>>> The user is allowed to [un]register a reserved IOVA range by using the
>>>>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>>>>>> It provides the base address and the size. This region is stored in the
>>>>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>>>>>> address yet. The host kernel will use those iova when needed, typically
>>>>>> when the VFIO-PCI device allocates its MSIs.
>>>>>>
>>>>>> This patch also handles the destruction of the reserved binding RB-tree and
>>>>>> domain's iova_domains.
>>>>>>
>>>>>> Signed-off-by: Eric Auger <eric.auger-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
>>>>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
>>>>>>
>>>>>> ---
>>>>>> v3 -> v4:
>>>>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>>>>>> - protect vfio_register_reserved_iova_range implementation with
>>>>>>   CONFIG_IOMMU_DMA_RESERVED
>>>>>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>>>>>
>>>>>> v1 -> v2:
>>>>>> - set returned value according to alloc_reserved_iova_domain result
>>>>>> - free the iova domains in case any error occurs
>>>>>>
>>>>>> RFC v1 -> v1:
>>>>>> - takes into account Alex comments, based on
>>>>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>>>>>> - use the existing dma map/unmap ioctl interface with a flag to register
>>>>>>   a reserved IOVA range. A single reserved iova region is allowed.
>>>>>>
>>>>>> Conflicts:
>>>>>> 	drivers/vfio/vfio_iommu_type1.c
>>>>>> ---
>>>>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>>>>>  include/uapi/linux/vfio.h       |  12 +++-
>>>>>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>> index c9ddbde..4497b20 100644
>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>> @@ -36,6 +36,7 @@
>>>>>>  #include <linux/uaccess.h>
>>>>>>  #include <linux/vfio.h>
>>>>>>  #include <linux/workqueue.h>
>>>>>> +#include <linux/dma-reserved-iommu.h>
>>>>>>  
>>>>>>  #define DRIVER_VERSION  "0.2"
>>>>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
>>>>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>>>  	vfio_lock_acct(-unlocked);
>>>>>>  }
>>>>>>  
>>>>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>>>>>> +{
>>>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>>>> +	struct vfio_domain *d;
>>>>>> +
>>>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>>>> +		iommu_unmap_reserved(d->domain);
>>>>>> +#endif
>>>>>> +}
>>>>>> +
>>>>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>>>  {
>>>>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>>>>>  		vfio_unmap_unpin(iommu, dma);
>>>>>> +	else
>>>>>> +		vfio_unmap_reserved(iommu);
>>>>>>  	vfio_unlink_dma(iommu, dma);
>>>>>>  	kfree(dma);
>>>>>>  }    
>>>>>
>>>>> This makes me nervous, apparently we can add reserved mappings
>>>>> individually, but we have absolutely no granularity on remove, so if we
>>>>> remove one, we've removed them all even though we still have them
>>>>> linked in our rb tree.  I see later that only one reserved region is
>>>>> allowed, but that seems very short sighted, especially to impose that
>>>>> on the user level API.    
>>>> On kernel-size the reserved region is currently backed by a unique
>>>> iova_domain. Do you mean you would like me to handle a list of
>>>> iova_domains instead of using a single "cookie"?  
>>>
>>> TBH, I'm not really sure how this works with a single iova domain.  If
>>> we have multiple irq chips and each gets mapped by a separate page in
>>> the iova space, then is it really sufficient to do a lookup from the
>>> irq_data to the msi_desc to the device to the domain in order to get
>>> reserved iova to map that msi doorbell?  Don't we need an iova from the
>>> pool mapping the specific irqchip associated with our device?  The IOMMU
>>> domain might span any number of irq chips, how can we assume there's
>>> one only reserved iova space?  Maybe I'm not understanding how the code
>>> works.  
>>
>> On vfio_iommu_type1 we currently compute the reserved iova needs for
>> each domain and we take the max. Each domain then is assigned a reserved
>> iova domain of this max size.
>>
>> So let's say domain1 has the largest needs (say 2 doorbells)
>> domain 1: iova domain size = 2
>> dev A --> doorbell 1
>> dev B -> doorbell 1
>> dev C -> doorbell 2
>> 2 iova pages are used
>>
>> domain 2: iova domain size = 2
>> dev D -> doorbell 1
>> 1 iova page is used.
>>
>> Do you see something wrong here?
> 
> Can we really know the maximum reserved iova space for a domain?  It
> seems like this depends on the current composition of the domain, so it
> could change as devices are added to the domain.  Or perhaps the
> maximum is based on a maximally configured domain, but even then the
> system itself may be expandable so it might need to account for an
> architectural maximum.  A user like QEMU would likely have an easier
> time dealing with an absolute maximum than a current maximum.  Maybe a
> single range would be sufficient under those conditions.

yes definitively if the domain evolves we may need to extend the
reserved iova domain. Also dealing with an arbitary absolute maximum is
much easier to integrate on (QEMU) user side.

> 
>>> Conceptually, this is a generic IOMMU API extension to include reserved
>>> ioor va space, MSI mappings are a consumer of that reserved iova pool but
>>> I don't think we can say they will necessarily be the only consumer.
>>> So building into the interface that there's only one is like making a
>>> fixed length array to hold a string, it works for the initial
>>> implementation, but it's not a robust solution.  
>>
>> I see. On the other hand the code is quite specific to MSI binding
>> problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
>> a teacup...
> 
> For the vfio api, the interface is already specific to MSI, so that
> seems reasonable.  I'd still rather expose somehow to the user that
> only a single reserved MSI region is supported, even if that's all the
> implementation can handle, just so we have the option to expand that in
> the future.  The iommu api is internal, so we can expand it as we go, i
> just want to be sure to raise the issue even if we think the
> restrictions are sufficient for now.  Thanks,

OK I agree. I Will change the doc/code accordingly.

Best Regards

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping
@ 2016-04-08 16:57               ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2016-04-08 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Alex,
On 04/08/2016 06:41 PM, Alex Williamson wrote:
> On Fri, 8 Apr 2016 17:48:01 +0200
> Eric Auger <eric.auger@linaro.org> wrote:
> 
> Hi Eric,
> 
>> Hi Alex,
>> On 04/07/2016 08:29 PM, Alex Williamson wrote:
>>> On Thu, 7 Apr 2016 15:43:29 +0200
>>> Eric Auger <eric.auger@linaro.org> wrote:
>>>   
>>>> Hi Alex,
>>>> On 04/07/2016 12:07 AM, Alex Williamson wrote:  
>>>>> On Mon,  4 Apr 2016 08:30:08 +0000
>>>>> Eric Auger <eric.auger@linaro.org> wrote:
>>>>>     
>>>>>> The user is allowed to [un]register a reserved IOVA range by using the
>>>>>> DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
>>>>>> It provides the base address and the size. This region is stored in the
>>>>>> vfio_dma rb tree. At that point the iova range is not mapped to any target
>>>>>> address yet. The host kernel will use those iova when needed, typically
>>>>>> when the VFIO-PCI device allocates its MSIs.
>>>>>>
>>>>>> This patch also handles the destruction of the reserved binding RB-tree and
>>>>>> domain's iova_domains.
>>>>>>
>>>>>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>>>>> Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
>>>>>>
>>>>>> ---
>>>>>> v3 -> v4:
>>>>>> - use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
>>>>>> - protect vfio_register_reserved_iova_range implementation with
>>>>>>   CONFIG_IOMMU_DMA_RESERVED
>>>>>> - handle unregistration by user-space and on vfio_iommu_type1 release
>>>>>>
>>>>>> v1 -> v2:
>>>>>> - set returned value according to alloc_reserved_iova_domain result
>>>>>> - free the iova domains in case any error occurs
>>>>>>
>>>>>> RFC v1 -> v1:
>>>>>> - takes into account Alex comments, based on
>>>>>>   [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
>>>>>> - use the existing dma map/unmap ioctl interface with a flag to register
>>>>>>   a reserved IOVA range. A single reserved iova region is allowed.
>>>>>>
>>>>>> Conflicts:
>>>>>> 	drivers/vfio/vfio_iommu_type1.c
>>>>>> ---
>>>>>>  drivers/vfio/vfio_iommu_type1.c | 141 +++++++++++++++++++++++++++++++++++++++-
>>>>>>  include/uapi/linux/vfio.h       |  12 +++-
>>>>>>  2 files changed, 150 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>> index c9ddbde..4497b20 100644
>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>> @@ -36,6 +36,7 @@
>>>>>>  #include <linux/uaccess.h>
>>>>>>  #include <linux/vfio.h>
>>>>>>  #include <linux/workqueue.h>
>>>>>> +#include <linux/dma-reserved-iommu.h>
>>>>>>  
>>>>>>  #define DRIVER_VERSION  "0.2"
>>>>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>>>>>> @@ -403,10 +404,22 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>>>  	vfio_lock_acct(-unlocked);
>>>>>>  }
>>>>>>  
>>>>>> +static void vfio_unmap_reserved(struct vfio_iommu *iommu)
>>>>>> +{
>>>>>> +#ifdef CONFIG_IOMMU_DMA_RESERVED
>>>>>> +	struct vfio_domain *d;
>>>>>> +
>>>>>> +	list_for_each_entry(d, &iommu->domain_list, next)
>>>>>> +		iommu_unmap_reserved(d->domain);
>>>>>> +#endif
>>>>>> +}
>>>>>> +
>>>>>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>>>  {
>>>>>>  	if (likely(dma->type != VFIO_IOVA_RESERVED))
>>>>>>  		vfio_unmap_unpin(iommu, dma);
>>>>>> +	else
>>>>>> +		vfio_unmap_reserved(iommu);
>>>>>>  	vfio_unlink_dma(iommu, dma);
>>>>>>  	kfree(dma);
>>>>>>  }    
>>>>>
>>>>> This makes me nervous, apparently we can add reserved mappings
>>>>> individually, but we have absolutely no granularity on remove, so if we
>>>>> remove one, we've removed them all even though we still have them
>>>>> linked in our rb tree.  I see later that only one reserved region is
>>>>> allowed, but that seems very short sighted, especially to impose that
>>>>> on the user level API.    
>>>> On kernel-size the reserved region is currently backed by a unique
>>>> iova_domain. Do you mean you would like me to handle a list of
>>>> iova_domains instead of using a single "cookie"?  
>>>
>>> TBH, I'm not really sure how this works with a single iova domain.  If
>>> we have multiple irq chips and each gets mapped by a separate page in
>>> the iova space, then is it really sufficient to do a lookup from the
>>> irq_data to the msi_desc to the device to the domain in order to get
>>> reserved iova to map that msi doorbell?  Don't we need an iova from the
>>> pool mapping the specific irqchip associated with our device?  The IOMMU
>>> domain might span any number of irq chips, how can we assume there's
>>> one only reserved iova space?  Maybe I'm not understanding how the code
>>> works.  
>>
>> On vfio_iommu_type1 we currently compute the reserved iova needs for
>> each domain and we take the max. Each domain then is assigned a reserved
>> iova domain of this max size.
>>
>> So let's say domain1 has the largest needs (say 2 doorbells)
>> domain 1: iova domain size = 2
>> dev A --> doorbell 1
>> dev B -> doorbell 1
>> dev C -> doorbell 2
>> 2 iova pages are used
>>
>> domain 2: iova domain size = 2
>> dev D -> doorbell 1
>> 1 iova page is used.
>>
>> Do you see something wrong here?
> 
> Can we really know the maximum reserved iova space for a domain?  It
> seems like this depends on the current composition of the domain, so it
> could change as devices are added to the domain.  Or perhaps the
> maximum is based on a maximally configured domain, but even then the
> system itself may be expandable so it might need to account for an
> architectural maximum.  A user like QEMU would likely have an easier
> time dealing with an absolute maximum than a current maximum.  Maybe a
> single range would be sufficient under those conditions.

yes definitively if the domain evolves we may need to extend the
reserved iova domain. Also dealing with an arbitary absolute maximum is
much easier to integrate on (QEMU) user side.

> 
>>> Conceptually, this is a generic IOMMU API extension to include reserved
>>> ioor va space, MSI mappings are a consumer of that reserved iova pool but
>>> I don't think we can say they will necessarily be the only consumer.
>>> So building into the interface that there's only one is like making a
>>> fixed length array to hold a string, it works for the initial
>>> implementation, but it's not a robust solution.  
>>
>> I see. On the other hand the code is quite specific to MSI binding
>> problematic today (rb-tree indexed on PA, locking, ...). argh, storm in
>> a teacup...
> 
> For the vfio api, the interface is already specific to MSI, so that
> seems reasonable.  I'd still rather expose somehow to the user that
> only a single reserved MSI region is supported, even if that's all the
> implementation can handle, just so we have the option to expand that in
> the future.  The iommu api is internal, so we can expand it as we go, i
> just want to be sure to raise the issue even if we think the
> restrictions are sufficient for now.  Thanks,

OK I agree. I Will change the doc/code accordingly.

Best Regards

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2016-04-08 16:58 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-04  8:30 [PATCH v6 0/5] KVM PCIe/MSI passthrough on ARM/ARM64: kernel part 3/3: vfio changes Eric Auger
2016-04-04  8:30 ` Eric Auger
2016-04-04  8:30 ` Eric Auger
2016-04-04  8:30 ` [PATCH v6 1/5] vfio: introduce VFIO_IOVA_RESERVED vfio_dma type Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  8:30 ` [PATCH v6 2/5] vfio: allow the user to register reserved iova range for MSI mapping Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  9:30   ` kbuild test robot
2016-04-04  9:30     ` kbuild test robot
2016-04-04  9:30     ` kbuild test robot
2016-04-04  9:30     ` kbuild test robot
2016-04-04  9:35     ` Eric Auger
2016-04-04  9:35       ` Eric Auger
2016-04-04  9:35       ` Eric Auger
2016-04-06 22:07   ` Alex Williamson
2016-04-06 22:07     ` Alex Williamson
2016-04-06 22:07     ` Alex Williamson
2016-04-07 13:43     ` Eric Auger
2016-04-07 13:43       ` Eric Auger
2016-04-07 13:43       ` Eric Auger
2016-04-07 18:29       ` Alex Williamson
2016-04-07 18:29         ` Alex Williamson
2016-04-08 15:48         ` Eric Auger
2016-04-08 15:48           ` Eric Auger
2016-04-08 15:48           ` Eric Auger
2016-04-08 16:41           ` Alex Williamson
2016-04-08 16:41             ` Alex Williamson
2016-04-08 16:41             ` Alex Williamson
2016-04-08 16:57             ` Eric Auger
2016-04-08 16:57               ` Eric Auger
2016-04-08 16:57               ` Eric Auger
2016-04-04  8:30 ` [PATCH v6 3/5] vfio/type1: also check IRQ remapping capability at msi domain Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  8:30 ` [PATCH v6 4/5] iommu/arm-smmu: do not advertise IOMMU_CAP_INTR_REMAP Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  8:30 ` [PATCH v6 5/5] vfio/type1: return MSI mapping requirements with VFIO_IOMMU_GET_INFO Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-04  8:30   ` Eric Auger
2016-04-06 22:32   ` Alex Williamson
2016-04-06 22:32     ` Alex Williamson
2016-04-06 22:32     ` Alex Williamson
2016-04-07 13:44     ` Eric Auger
2016-04-07 13:44       ` Eric Auger
2016-04-07 13:44       ` Eric Auger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.