All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel v2 00/11] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2016-12-18  1:28 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on a merge of the "next" branch of
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
and the "next" of https://github.com/awilliam/linux-vfio.git
and pushed to https://github.com/aik/linux.git , vfio-kvm-next branch.

Changes:
v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.

Please comment. Thanks.

Alexey Kardashevskiy (11):
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/iommu: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  vfio iommu: Add helpers to (un)register blocking notifiers per group
  vfio: Check for unregistered notifiers when group is actually released
  KVM: PPC: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/iommu.h           |  12 +-
 arch/powerpc/include/asm/kvm_host.h        |  11 +
 arch/powerpc/include/asm/kvm_ppc.h         |   6 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/linux/vfio.h                       |   6 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  49 +++-
 arch/powerpc/kvm/book3s_64_vio.c           | 359 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 256 ++++++++++++++++++--
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  42 +++-
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio.c                        |  81 ++++++-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            |  82 +++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 20 files changed, 939 insertions(+), 50 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 00/11] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2016-12-18  1:28 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on a merge of the "next" branch of
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
and the "next" of https://github.com/awilliam/linux-vfio.git
and pushed to https://github.com/aik/linux.git , vfio-kvm-next branch.

Changes:
v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.

Please comment. Thanks.

Alexey Kardashevskiy (11):
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/iommu: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  vfio iommu: Add helpers to (un)register blocking notifiers per group
  vfio: Check for unregistered notifiers when group is actually released
  KVM: PPC: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/iommu.h           |  12 +-
 arch/powerpc/include/asm/kvm_host.h        |  11 +
 arch/powerpc/include/asm/kvm_ppc.h         |   6 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/linux/vfio.h                       |   6 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  49 +++-
 arch/powerpc/kvm/book3s_64_vio.c           | 359 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 256 ++++++++++++++++++--
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  42 +++-
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio.c                        |  81 ++++++-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            |  82 +++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 20 files changed, 939 insertions(+), 50 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 01/11] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 300ef255d1e0..810f74317987 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 01/11] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 300ef255d1e0..810f74317987 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 02/11] powerpc/iommu: Cleanup iommu_table disposal
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kernel/iommu.c               | 4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..6744a2771769 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5fcae29107e1..c4f9e812ca6c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2013,7 +2012,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2339,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2425,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c8823578a1b2..cbac08af400e 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 02/11] powerpc/iommu: Cleanup iommu_table disposal
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kernel/iommu.c               | 4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..6744a2771769 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5fcae29107e1..c4f9e812ca6c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2013,7 +2012,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2339,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2425,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c8823578a1b2..cbac08af400e 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 03/11] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 arch/powerpc/platforms/pseries/vio.c      |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..9de8bad1fdf9 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -114,6 +114,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 6744a2771769..d12496889ce9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c4f9e812ca6c..ea181f02bebd 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2197,7 +2197,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2291,7 +2291,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2337,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2423,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3393,7 +3393,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3420,7 +3420,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c6d554fe585c..471210913e42 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index dc2577fc5fbb..47f0501a94f9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 2c8fb3ec989e..41e8aa5c0d6a 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index cbac08af400e..be37905012f0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 03/11] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 arch/powerpc/platforms/pseries/vio.c      |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..9de8bad1fdf9 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -114,6 +114,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 6744a2771769..d12496889ce9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c4f9e812ca6c..ea181f02bebd 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2197,7 +2197,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2291,7 +2291,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2337,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2423,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3393,7 +3393,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3420,7 +3420,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c6d554fe585c..471210913e42 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index dc2577fc5fbb..47f0501a94f9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 2c8fb3ec989e..41e8aa5c0d6a 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index cbac08af400e..be37905012f0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 04/11] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 104bad029ce9..631d32f5937b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *pa;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	pa = (void *) vmalloc_to_phys(va);
+	if (!pa)
+		return -EFAULT;
+
+	*hpa = *pa | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 04/11] powerpc/mmu: Add real mode support for IOMMU preregistered memory
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 104bad029ce9..631d32f5937b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *pa;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	pa = (void *) vmalloc_to_phys(va);
+	if (!pa)
+		return -EFAULT;
+
+	*hpa = *pa | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered,
this will fail with H_TOO_HARD so the virtual mode handle can
handle the request.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index d461c440889a..a3be4bd6188f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+	return mm_iommu_preregistered(vcpu->kvm->mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
@@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (kvmppc_preregistered(vcpu)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
+			return H_TOO_HARD;
+	} else {
+		/*
+		 * This is emulated devices case.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered,
this will fail with H_TOO_HARD so the virtual mode handle can
handle the request.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index d461c440889a..a3be4bd6188f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+	return mm_iommu_preregistered(vcpu->kvm->mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
@@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (kvmppc_preregistered(vcpu)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
+			return H_TOO_HARD;
+	} else {
+		/*
+		 * This is emulated devices case.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 06/11] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9de8bad1fdf9..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d12496889ce9..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1022,6 +1022,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+			(*direction == DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ea181f02bebd..f2c2ab8fbb3e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1855,6 +1855,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1869,6 +1880,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1943,7 +1955,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -1999,6 +2011,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -2018,6 +2041,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 06/11] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9de8bad1fdf9..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d12496889ce9..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1022,6 +1022,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
+			(*direction = DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ea181f02bebd..f2c2ab8fbb3e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1855,6 +1855,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1869,6 +1880,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1943,7 +1955,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -1999,6 +2011,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -2018,6 +2041,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 07/11] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 07/11] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f6e49640dbe1..0a21c8503974 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5a4438..15df8ae627d9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index a3be4bd6188f..8a6834e6e1c8 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -49,10 +49,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f6e49640dbe1..0a21c8503974 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5a4438..15df8ae627d9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index a3be4bd6188f..8a6834e6e1c8 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -49,10 +49,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

c086de81 "vfio iommu: Add blocking notifier to notify DMA_UNMAP" added
notifiers to a VFIO group. However even though the code underneath
uses groups, the API takes device struct pointers.

This adds helpers which do the same thing but take IOMMU groups instead.

This adds vfio_iommu_group_set_kvm() which is a wrapper on top of
vfio_group_set_kvm() but also takes an iommu_group.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/vfio.h |  6 +++++
 drivers/vfio/vfio.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index edf9b2cad277..8a3488ba4732 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -127,9 +127,15 @@ extern int vfio_register_notifier(struct device *dev,
 extern int vfio_unregister_notifier(struct device *dev,
 				    enum vfio_notify_type type,
 				    struct notifier_block *nb);
+extern int vfio_iommu_group_register_notifier(struct iommu_group *grp,
+		enum vfio_notify_type type, unsigned long *events,
+		struct notifier_block *nb);
+extern int vfio_iommu_group_unregister_notifier(struct iommu_group *grp,
+		enum vfio_notify_type type, struct notifier_block *nb);
 
 struct kvm;
 extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
+extern void vfio_iommu_group_set_kvm(struct iommu_group *grp, struct kvm *kvm);
 
 /*
  * Sub-module helpers
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 9901c4671e2f..6b9a98508939 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2077,6 +2077,23 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
 
+void vfio_iommu_group_set_kvm(struct iommu_group *grp, struct kvm *kvm)
+{
+	struct vfio_group *group;
+
+	if (!grp)
+		return;
+
+	group = vfio_group_get_from_iommu(grp);
+	if (!group)
+		return;
+
+	vfio_group_set_kvm(group, kvm);
+
+	vfio_group_put(group);
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_group_set_kvm);
+
 static int vfio_register_group_notifier(struct vfio_group *group,
 					unsigned long *events,
 					struct notifier_block *nb)
@@ -2197,6 +2214,65 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
 }
 EXPORT_SYMBOL(vfio_unregister_notifier);
 
+int vfio_iommu_group_register_notifier(struct iommu_group *iommugroup,
+		enum vfio_notify_type type,
+		unsigned long *events, struct notifier_block *nb)
+{
+	struct vfio_group *group;
+	int ret;
+
+	if (!iommugroup || !nb || !events || (*events == 0))
+		return -EINVAL;
+
+	group = vfio_group_get_from_iommu(iommugroup);
+	if (!group)
+		return -ENODEV;
+
+	switch (type) {
+	case VFIO_IOMMU_NOTIFY:
+		ret = vfio_register_iommu_notifier(group, events, nb);
+		break;
+	case VFIO_GROUP_NOTIFY:
+		ret = vfio_register_group_notifier(group, events, nb);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_iommu_group_register_notifier);
+
+int vfio_iommu_group_unregister_notifier(struct iommu_group *grp,
+		enum vfio_notify_type type, struct notifier_block *nb)
+{
+	struct vfio_group *group;
+	int ret;
+
+	if (!grp || !nb)
+		return -EINVAL;
+
+	group = vfio_group_get_from_iommu(grp);
+	if (!group)
+		return -ENODEV;
+
+	switch (type) {
+	case VFIO_IOMMU_NOTIFY:
+		ret = vfio_unregister_iommu_notifier(group, nb);
+		break;
+	case VFIO_GROUP_NOTIFY:
+		ret = vfio_unregister_group_notifier(group, nb);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_iommu_group_unregister_notifier);
+
 /**
  * Module/class support
  */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

c086de81 "vfio iommu: Add blocking notifier to notify DMA_UNMAP" added
notifiers to a VFIO group. However even though the code underneath
uses groups, the API takes device struct pointers.

This adds helpers which do the same thing but take IOMMU groups instead.

This adds vfio_iommu_group_set_kvm() which is a wrapper on top of
vfio_group_set_kvm() but also takes an iommu_group.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/vfio.h |  6 +++++
 drivers/vfio/vfio.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index edf9b2cad277..8a3488ba4732 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -127,9 +127,15 @@ extern int vfio_register_notifier(struct device *dev,
 extern int vfio_unregister_notifier(struct device *dev,
 				    enum vfio_notify_type type,
 				    struct notifier_block *nb);
+extern int vfio_iommu_group_register_notifier(struct iommu_group *grp,
+		enum vfio_notify_type type, unsigned long *events,
+		struct notifier_block *nb);
+extern int vfio_iommu_group_unregister_notifier(struct iommu_group *grp,
+		enum vfio_notify_type type, struct notifier_block *nb);
 
 struct kvm;
 extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
+extern void vfio_iommu_group_set_kvm(struct iommu_group *grp, struct kvm *kvm);
 
 /*
  * Sub-module helpers
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 9901c4671e2f..6b9a98508939 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2077,6 +2077,23 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
 
+void vfio_iommu_group_set_kvm(struct iommu_group *grp, struct kvm *kvm)
+{
+	struct vfio_group *group;
+
+	if (!grp)
+		return;
+
+	group = vfio_group_get_from_iommu(grp);
+	if (!group)
+		return;
+
+	vfio_group_set_kvm(group, kvm);
+
+	vfio_group_put(group);
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_group_set_kvm);
+
 static int vfio_register_group_notifier(struct vfio_group *group,
 					unsigned long *events,
 					struct notifier_block *nb)
@@ -2197,6 +2214,65 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
 }
 EXPORT_SYMBOL(vfio_unregister_notifier);
 
+int vfio_iommu_group_register_notifier(struct iommu_group *iommugroup,
+		enum vfio_notify_type type,
+		unsigned long *events, struct notifier_block *nb)
+{
+	struct vfio_group *group;
+	int ret;
+
+	if (!iommugroup || !nb || !events || (*events = 0))
+		return -EINVAL;
+
+	group = vfio_group_get_from_iommu(iommugroup);
+	if (!group)
+		return -ENODEV;
+
+	switch (type) {
+	case VFIO_IOMMU_NOTIFY:
+		ret = vfio_register_iommu_notifier(group, events, nb);
+		break;
+	case VFIO_GROUP_NOTIFY:
+		ret = vfio_register_group_notifier(group, events, nb);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_iommu_group_register_notifier);
+
+int vfio_iommu_group_unregister_notifier(struct iommu_group *grp,
+		enum vfio_notify_type type, struct notifier_block *nb)
+{
+	struct vfio_group *group;
+	int ret;
+
+	if (!grp || !nb)
+		return -EINVAL;
+
+	group = vfio_group_get_from_iommu(grp);
+	if (!group)
+		return -ENODEV;
+
+	switch (type) {
+	case VFIO_IOMMU_NOTIFY:
+		ret = vfio_unregister_iommu_notifier(group, nb);
+		break;
+	case VFIO_GROUP_NOTIFY:
+		ret = vfio_unregister_group_notifier(group, nb);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_iommu_group_unregister_notifier);
+
 /**
  * Module/class support
  */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This moves a check for unregistered notifiers from fops release
callback to the place where the group will actually be released.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This is going to be used in the following patch in cleanup
path. Since the next patch is RFC, this one might not be needed.
---
 drivers/vfio/vfio.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6b9a98508939..083b581e87c0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
 	struct iommu_group *iommu_group = group->iommu_group;
 
 	WARN_ON(!list_empty(&group->device_list));
+	/* Any user didn't unregister? */
+	WARN_ON(group->notifier.head);
 
 	list_for_each_entry_safe(unbound, tmp,
 				 &group->unbound_list, unbound_next) {
@@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 
 	filep->private_data = NULL;
 
-	/* Any user didn't unregister? */
-	WARN_ON(group->notifier.head);
-
 	vfio_group_try_dissolve_container(group);
 
 	atomic_dec(&group->opened);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
@ 2016-12-18  1:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This moves a check for unregistered notifiers from fops release
callback to the place where the group will actually be released.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This is going to be used in the following patch in cleanup
path. Since the next patch is RFC, this one might not be needed.
---
 drivers/vfio/vfio.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6b9a98508939..083b581e87c0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
 	struct iommu_group *iommu_group = group->iommu_group;
 
 	WARN_ON(!list_empty(&group->device_list));
+	/* Any user didn't unregister? */
+	WARN_ON(group->notifier.head);
 
 	list_for_each_entry_safe(unbound, tmp,
 				 &group->unbound_list, unbound_next) {
@@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 
 	filep->private_data = NULL;
 
-	/* Any user didn't unregister? */
-	WARN_ON(group->notifier.head);
-
 	vfio_group_try_dissolve_container(group);
 
 	atomic_dec(&group->opened);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-18  1:28 ` Alexey Kardashevskiy
@ 2016-12-18  1:29   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is referenced so we do not have to retrieve in real mode when hypercall
happens.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is detroyed so this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This uses the kvm->lock mutex to protect against a race between
the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
release() callback.

This uses per recently introduced VFIO group notifiers to do VFIO KVM
device cleanup. The KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler registers
a notifier which removed itself from the notifier chain when called
with kvm==NULL. Since a block notifier cannot remove itself when
it is executing, this uses RCU to do actual removal. To make this work,
the KVM holds an additional (to KVM_DEV_VFIO_GROUP_ADD) external user
reference to a group and also relies on
"vfio: Check for unregistered notifiers when group is actually released".
While it works in most cases, there is a problem with VFIO PCI hotunplug:
QEMU calls vfio_kvm_device_del_group() and vfio_disconnect_container()
and the VFIO_GROUP_UNSET_CONTAINER handler may fail if RCU handler
has not been called yet so group->container_users counter is still
non-zero; this is workarounded by retrying VFIO_GROUP_UNSET_CONTAINER
in QEMU (this needs to be fixed, suggestions are welcome).

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN; this is done to simplify the cleanup and can be
improved later (even though there does not seem to be a user for it
any time soon).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---
 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/kvm_host.h        |  11 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 352 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  82 +++++++
 8 files changed, 657 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..1079c25c1973 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,16 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct iommu_group *grp;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+	bool dying;
+	struct notifier_block notifier;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +209,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..113fb2db2de9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+		int tablefd,
+		struct vfio_group *group,
+		struct iommu_group *grp);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 810f74317987..4088da4a575f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1068,6 +1068,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1089,6 +1090,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..9ab04cdcc578 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,74 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+extern int kvm_vfio_iommu_group_register_notifier(
+		struct iommu_group *iommugroup, enum vfio_notify_type type,
+		unsigned long *events, struct notifier_block *nb)
+{
+	int ret;
+	int (*fn)(struct iommu_group *iommugroup,
+			enum vfio_notify_type type,
+			unsigned long *events, struct notifier_block *nb);
+
+	fn = symbol_get(vfio_iommu_group_register_notifier);
+	if (!fn)
+		return -EINVAL;
+
+	ret = fn(iommugroup, type, events, nb);
+
+	symbol_put(vfio_iommu_group_register_notifier);
+
+	return ret;
+}
+
+extern int kvm_vfio_iommu_group_unregister_notifier(
+		struct iommu_group *iommugroup, enum vfio_notify_type type,
+		struct notifier_block *nb)
+{
+	int ret;
+	int (*fn)(struct iommu_group *iommugroup,
+			enum vfio_notify_type type,
+			struct notifier_block *nb);
+
+	fn = symbol_get(vfio_iommu_group_unregister_notifier);
+	if (!fn)
+		return -EINVAL;
+
+	ret = fn(iommugroup, type, nb);
+
+	symbol_put(vfio_iommu_group_unregister_notifier);
+
+	return ret;
+}
+
+extern void kvm_vfio_iommu_group_set_kvm(struct iommu_group *grp,
+		struct kvm *kvm)
+{
+	void (*fn)(struct iommu_group *grp, struct kvm *kvm);
+
+	fn = symbol_get(vfio_iommu_group_set_kvm);
+	if (!fn)
+		return;
+
+	fn(grp, kvm);
+
+	symbol_put(vfio_iommu_group_set_kvm);
+}
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (!fn)
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -129,9 +201,14 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
 static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 {
 	struct kvmppc_spapr_tce_table *stt = filp->private_data;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
+	kick_all_cpus_sync();
 	list_del_rcu(&stt->list);
 
+	list_for_each_entry_rcu(stit, &stt->iommu_tables, next)
+		kvm_vfio_iommu_group_set_kvm(stit->grp, NULL);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -146,6 +223,108 @@ static const struct file_operations kvm_spapr_tce_fops = {
 	.release	= kvm_spapr_tce_release,
 };
 
+static void kvm_spapr_tce_iommu_notifier_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	kvm_vfio_iommu_group_unregister_notifier(stit->grp, VFIO_GROUP_NOTIFY,
+			&stit->notifier);
+	iommu_table_put(stit->tbl);
+	kvm_vfio_group_put_external_user(stit->group);
+	kfree(stit);
+}
+
+static int kvm_spapr_tce_iommu_notifier(struct notifier_block *nb,
+		unsigned long action, void *data)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(nb,
+			struct kvmppc_spapr_tce_iommu_table, notifier);
+
+	if (data)
+		return 0;
+
+	if (stit->dying)
+		return 0;
+
+	stit->dying = true;
+
+	list_del_rcu(&stit->next);
+
+	call_rcu(&stit->rcu, kvm_spapr_tce_iommu_notifier_free);
+
+	return 0;
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group,
+		struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long events;
+	struct fd f;
+
+	f = fdget(tablefd);
+	if (!f.file)
+		return -EBADF;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt == f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found)
+		return -ENODEV;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		if ((tbltmp->it_page_shift == stt->page_shift) &&
+				(tbltmp->it_offset == stt->offset)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl)
+		return -ENODEV;
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+	stit->grp = grp;
+	stit->notifier.notifier_call = kvm_spapr_tce_iommu_notifier;
+
+	events = VFIO_GROUP_NOTIFY_SET_KVM;
+	ret = kvm_vfio_iommu_group_register_notifier(grp,
+			VFIO_GROUP_NOTIFY, &events, &stit->notifier);
+	if (ret) {
+		iommu_table_put(tbl);
+		kfree(stit);
+		return ret;
+	}
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+	return 0;
+}
+
 long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				   struct kvm_create_spapr_tce_64 *args)
 {
@@ -181,6 +360,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +389,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +560,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -245,6 +581,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +609,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				stit->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -299,6 +643,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +657,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8a6834e6e1c8..4d6f01712a6d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_iommu_table *stit;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					stit->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70963c845e96..0e555ba998c0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..6077c63e0235 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 	return ret > 0;
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously
@@ -218,6 +238,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group, grp);
+
+			iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +321,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-18  1:29   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-18  1:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is referenced so we do not have to retrieve in real mode when hypercall
happens.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is detroyed so this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This uses the kvm->lock mutex to protect against a race between
the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
release() callback.

This uses per recently introduced VFIO group notifiers to do VFIO KVM
device cleanup. The KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler registers
a notifier which removed itself from the notifier chain when called
with kvm=NULL. Since a block notifier cannot remove itself when
it is executing, this uses RCU to do actual removal. To make this work,
the KVM holds an additional (to KVM_DEV_VFIO_GROUP_ADD) external user
reference to a group and also relies on
"vfio: Check for unregistered notifiers when group is actually released".
While it works in most cases, there is a problem with VFIO PCI hotunplug:
QEMU calls vfio_kvm_device_del_group() and vfio_disconnect_container()
and the VFIO_GROUP_UNSET_CONTAINER handler may fail if RCU handler
has not been called yet so group->container_users counter is still
non-zero; this is workarounded by retrying VFIO_GROUP_UNSET_CONTAINER
in QEMU (this needs to be fixed, suggestions are welcome).

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN; this is done to simplify the cleanup and can be
improved later (even though there does not seem to be a user for it
any time soon).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---
 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/kvm_host.h        |  11 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 352 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  82 +++++++
 8 files changed, 657 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..1079c25c1973 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,16 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct iommu_group *grp;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+	bool dying;
+	struct notifier_block notifier;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +209,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..113fb2db2de9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+		int tablefd,
+		struct vfio_group *group,
+		struct iommu_group *grp);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 810f74317987..4088da4a575f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1068,6 +1068,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1089,6 +1090,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..9ab04cdcc578 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,74 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+extern int kvm_vfio_iommu_group_register_notifier(
+		struct iommu_group *iommugroup, enum vfio_notify_type type,
+		unsigned long *events, struct notifier_block *nb)
+{
+	int ret;
+	int (*fn)(struct iommu_group *iommugroup,
+			enum vfio_notify_type type,
+			unsigned long *events, struct notifier_block *nb);
+
+	fn = symbol_get(vfio_iommu_group_register_notifier);
+	if (!fn)
+		return -EINVAL;
+
+	ret = fn(iommugroup, type, events, nb);
+
+	symbol_put(vfio_iommu_group_register_notifier);
+
+	return ret;
+}
+
+extern int kvm_vfio_iommu_group_unregister_notifier(
+		struct iommu_group *iommugroup, enum vfio_notify_type type,
+		struct notifier_block *nb)
+{
+	int ret;
+	int (*fn)(struct iommu_group *iommugroup,
+			enum vfio_notify_type type,
+			struct notifier_block *nb);
+
+	fn = symbol_get(vfio_iommu_group_unregister_notifier);
+	if (!fn)
+		return -EINVAL;
+
+	ret = fn(iommugroup, type, nb);
+
+	symbol_put(vfio_iommu_group_unregister_notifier);
+
+	return ret;
+}
+
+extern void kvm_vfio_iommu_group_set_kvm(struct iommu_group *grp,
+		struct kvm *kvm)
+{
+	void (*fn)(struct iommu_group *grp, struct kvm *kvm);
+
+	fn = symbol_get(vfio_iommu_group_set_kvm);
+	if (!fn)
+		return;
+
+	fn(grp, kvm);
+
+	symbol_put(vfio_iommu_group_set_kvm);
+}
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (!fn)
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -129,9 +201,14 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
 static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 {
 	struct kvmppc_spapr_tce_table *stt = filp->private_data;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
+	kick_all_cpus_sync();
 	list_del_rcu(&stt->list);
 
+	list_for_each_entry_rcu(stit, &stt->iommu_tables, next)
+		kvm_vfio_iommu_group_set_kvm(stit->grp, NULL);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -146,6 +223,108 @@ static const struct file_operations kvm_spapr_tce_fops = {
 	.release	= kvm_spapr_tce_release,
 };
 
+static void kvm_spapr_tce_iommu_notifier_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	kvm_vfio_iommu_group_unregister_notifier(stit->grp, VFIO_GROUP_NOTIFY,
+			&stit->notifier);
+	iommu_table_put(stit->tbl);
+	kvm_vfio_group_put_external_user(stit->group);
+	kfree(stit);
+}
+
+static int kvm_spapr_tce_iommu_notifier(struct notifier_block *nb,
+		unsigned long action, void *data)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(nb,
+			struct kvmppc_spapr_tce_iommu_table, notifier);
+
+	if (data)
+		return 0;
+
+	if (stit->dying)
+		return 0;
+
+	stit->dying = true;
+
+	list_del_rcu(&stit->next);
+
+	call_rcu(&stit->rcu, kvm_spapr_tce_iommu_notifier_free);
+
+	return 0;
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group,
+		struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long events;
+	struct fd f;
+
+	f = fdget(tablefd);
+	if (!f.file)
+		return -EBADF;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt = f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found)
+		return -ENODEV;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		if ((tbltmp->it_page_shift = stt->page_shift) &&
+				(tbltmp->it_offset = stt->offset)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl)
+		return -ENODEV;
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+	stit->grp = grp;
+	stit->notifier.notifier_call = kvm_spapr_tce_iommu_notifier;
+
+	events = VFIO_GROUP_NOTIFY_SET_KVM;
+	ret = kvm_vfio_iommu_group_register_notifier(grp,
+			VFIO_GROUP_NOTIFY, &events, &stit->notifier);
+	if (ret) {
+		iommu_table_put(tbl);
+		kfree(stit);
+		return ret;
+	}
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+	return 0;
+}
+
 long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				   struct kvm_create_spapr_tce_64 *args)
 {
@@ -181,6 +360,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +389,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir = DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +560,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -245,6 +581,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +609,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				stit->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -299,6 +643,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +657,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8a6834e6e1c8..4d6f01712a6d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir = DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_iommu_table *stit;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					stit->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70963c845e96..0e555ba998c0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..6077c63e0235 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 	return ret > 0;
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously
@@ -218,6 +238,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group, grp);
+
+			iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +321,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
  2016-12-18  1:28   ` Alexey Kardashevskiy
@ 2016-12-19 10:41     ` Jike Song
  -1 siblings, 0 replies; 56+ messages in thread
From: Jike Song @ 2016-12-19 10:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, David Gibson, Paul Mackerras,
	kvm-ppc, kvm

On 12/18/2016 09:28 AM, Alexey Kardashevskiy wrote:
> This moves a check for unregistered notifiers from fops release
> callback to the place where the group will actually be released.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> This is going to be used in the following patch in cleanup
> path. Since the next patch is RFC, this one might not be needed.

Hi Alexey,

I didn't find any use in the following patch 11/11, did you mean
something else?

BTW the warning in vfio_group_release seems too late, the user
should actually unregister everything by close()?

--
Thanks,
Jike

> ---
>  drivers/vfio/vfio.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6b9a98508939..083b581e87c0 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
>  	struct iommu_group *iommu_group = group->iommu_group;
>  
>  	WARN_ON(!list_empty(&group->device_list));
> +	/* Any user didn't unregister? */
> +	WARN_ON(group->notifier.head);
>  
>  	list_for_each_entry_safe(unbound, tmp,
>  				 &group->unbound_list, unbound_next) {
> @@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>  
>  	filep->private_data = NULL;
>  
> -	/* Any user didn't unregister? */
> -	WARN_ON(group->notifier.head);
> -
>  	vfio_group_try_dissolve_container(group);
>  
>  	atomic_dec(&group->opened);
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
@ 2016-12-19 10:41     ` Jike Song
  0 siblings, 0 replies; 56+ messages in thread
From: Jike Song @ 2016-12-19 10:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, David Gibson, Paul Mackerras,
	kvm-ppc, kvm

On 12/18/2016 09:28 AM, Alexey Kardashevskiy wrote:
> This moves a check for unregistered notifiers from fops release
> callback to the place where the group will actually be released.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> This is going to be used in the following patch in cleanup
> path. Since the next patch is RFC, this one might not be needed.

Hi Alexey,

I didn't find any use in the following patch 11/11, did you mean
something else?

BTW the warning in vfio_group_release seems too late, the user
should actually unregister everything by close()?

--
Thanks,
Jike

> ---
>  drivers/vfio/vfio.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6b9a98508939..083b581e87c0 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
>  	struct iommu_group *iommu_group = group->iommu_group;
>  
>  	WARN_ON(!list_empty(&group->device_list));
> +	/* Any user didn't unregister? */
> +	WARN_ON(group->notifier.head);
>  
>  	list_for_each_entry_safe(unbound, tmp,
>  				 &group->unbound_list, unbound_next) {
> @@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>  
>  	filep->private_data = NULL;
>  
> -	/* Any user didn't unregister? */
> -	WARN_ON(group->notifier.head);
> -
>  	vfio_group_try_dissolve_container(group);
>  
>  	atomic_dec(&group->opened);
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
  2016-12-19 10:41     ` Jike Song
  (?)
@ 2016-12-19 16:28       ` Alex Williamson
  -1 siblings, 0 replies; 56+ messages in thread
From: Alex Williamson @ 2016-12-19 16:28 UTC (permalink / raw)
  To: Jike Song
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Paul Mackerras, linuxppc-dev,
	David Gibson

On Mon, 19 Dec 2016 18:41:05 +0800
Jike Song <jike.song@intel.com> wrote:

> On 12/18/2016 09:28 AM, Alexey Kardashevskiy wrote:
> > This moves a check for unregistered notifiers from fops release
> > callback to the place where the group will actually be released.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > 
> > This is going to be used in the following patch in cleanup
> > path. Since the next patch is RFC, this one might not be needed.  

Alexey, this is intended to be a bug fix, it should be sent separately,
not buried in an unrelated patch series.

> I didn't find any use in the following patch 11/11, did you mean
> something else?
> 
> BTW the warning in vfio_group_release seems too late, the user
> should actually unregister everything by close()?

The thing is, it's not the user that registered the notifiers, it's the
vendor driver.  The vendor driver should know via the device release to
unregister the notifier, which we're counting on to happen before the
group release.  Can we rely on that ordering even in the case where a
user is SIGKILL'd?

> > ---
> >  drivers/vfio/vfio.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 6b9a98508939..083b581e87c0 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
> >  	struct iommu_group *iommu_group = group->iommu_group;
> >  
> >  	WARN_ON(!list_empty(&group->device_list));
> > +	/* Any user didn't unregister? */
> > +	WARN_ON(group->notifier.head);

Even if this is a bug, this is the wrong fix.  This is when the group
is being destroyed.  Yes, it would be a bug to still have any notifiers
here, but the intention is to make sure there are no notifiers when the
group is idle and unused.

> >  
> >  	list_for_each_entry_safe(unbound, tmp,
> >  				 &group->unbound_list, unbound_next) {
> > @@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
> >  
> >  	filep->private_data = NULL;
> >  
> > -	/* Any user didn't unregister? */
> > -	WARN_ON(group->notifier.head);
> > -
> >  	vfio_group_try_dissolve_container(group);
> >  
> >  	atomic_dec(&group->opened);
> >   

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
@ 2016-12-19 16:28       ` Alex Williamson
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Williamson @ 2016-12-19 16:28 UTC (permalink / raw)
  To: Jike Song
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras,
	kvm-ppc, kvm

On Mon, 19 Dec 2016 18:41:05 +0800
Jike Song <jike.song@intel.com> wrote:

> On 12/18/2016 09:28 AM, Alexey Kardashevskiy wrote:
> > This moves a check for unregistered notifiers from fops release
> > callback to the place where the group will actually be released.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > 
> > This is going to be used in the following patch in cleanup
> > path. Since the next patch is RFC, this one might not be needed.  

Alexey, this is intended to be a bug fix, it should be sent separately,
not buried in an unrelated patch series.

> I didn't find any use in the following patch 11/11, did you mean
> something else?
> 
> BTW the warning in vfio_group_release seems too late, the user
> should actually unregister everything by close()?

The thing is, it's not the user that registered the notifiers, it's the
vendor driver.  The vendor driver should know via the device release to
unregister the notifier, which we're counting on to happen before the
group release.  Can we rely on that ordering even in the case where a
user is SIGKILL'd?

> > ---
> >  drivers/vfio/vfio.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 6b9a98508939..083b581e87c0 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
> >  	struct iommu_group *iommu_group = group->iommu_group;
> >  
> >  	WARN_ON(!list_empty(&group->device_list));
> > +	/* Any user didn't unregister? */
> > +	WARN_ON(group->notifier.head);

Even if this is a bug, this is the wrong fix.  This is when the group
is being destroyed.  Yes, it would be a bug to still have any notifiers
here, but the intention is to make sure there are no notifiers when the
group is idle and unused.

> >  
> >  	list_for_each_entry_safe(unbound, tmp,
> >  				 &group->unbound_list, unbound_next) {
> > @@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
> >  
> >  	filep->private_data = NULL;
> >  
> > -	/* Any user didn't unregister? */
> > -	WARN_ON(group->notifier.head);
> > -
> >  	vfio_group_try_dissolve_container(group);
> >  
> >  	atomic_dec(&group->opened);
> >   

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
@ 2016-12-19 16:28       ` Alex Williamson
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Williamson @ 2016-12-19 16:28 UTC (permalink / raw)
  To: Jike Song
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Paul Mackerras, linuxppc-dev,
	David Gibson

On Mon, 19 Dec 2016 18:41:05 +0800
Jike Song <jike.song@intel.com> wrote:

> On 12/18/2016 09:28 AM, Alexey Kardashevskiy wrote:
> > This moves a check for unregistered notifiers from fops release
> > callback to the place where the group will actually be released.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > 
> > This is going to be used in the following patch in cleanup
> > path. Since the next patch is RFC, this one might not be needed.  

Alexey, this is intended to be a bug fix, it should be sent separately,
not buried in an unrelated patch series.

> I didn't find any use in the following patch 11/11, did you mean
> something else?
> 
> BTW the warning in vfio_group_release seems too late, the user
> should actually unregister everything by close()?

The thing is, it's not the user that registered the notifiers, it's the
vendor driver.  The vendor driver should know via the device release to
unregister the notifier, which we're counting on to happen before the
group release.  Can we rely on that ordering even in the case where a
user is SIGKILL'd?

> > ---
> >  drivers/vfio/vfio.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 6b9a98508939..083b581e87c0 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
> >  	struct iommu_group *iommu_group = group->iommu_group;
> >  
> >  	WARN_ON(!list_empty(&group->device_list));
> > +	/* Any user didn't unregister? */
> > +	WARN_ON(group->notifier.head);

Even if this is a bug, this is the wrong fix.  This is when the group
is being destroyed.  Yes, it would be a bug to still have any notifiers
here, but the intention is to make sure there are no notifiers when the
group is idle and unused.

> >  
> >  	list_for_each_entry_safe(unbound, tmp,
> >  				 &group->unbound_list, unbound_next) {
> > @@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
> >  
> >  	filep->private_data = NULL;
> >  
> > -	/* Any user didn't unregister? */
> > -	WARN_ON(group->notifier.head);
> > -
> >  	vfio_group_try_dissolve_container(group);
> >  
> >  	atomic_dec(&group->opened);
> >   


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
  2016-12-19 16:28       ` Alex Williamson
@ 2016-12-19 22:41         ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-19 22:41 UTC (permalink / raw)
  To: Alex Williamson, Jike Song
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 20/12/16 03:28, Alex Williamson wrote:
> On Mon, 19 Dec 2016 18:41:05 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 12/18/2016 09:28 AM, Alexey Kardashevskiy wrote:
>>> This moves a check for unregistered notifiers from fops release
>>> callback to the place where the group will actually be released.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> This is going to be used in the following patch in cleanup
>>> path. Since the next patch is RFC, this one might not be needed.  
> 
> Alexey, this is intended to be a bug fix, it should be sent separately,
> not buried in an unrelated patch series.

Well, I was pretty unsure this is a correct fix and I was sure that it
would make sense without the context which is quite weird :)

> 
>> I didn't find any use in the following patch 11/11, did you mean
>> something else?
>>
>> BTW the warning in vfio_group_release seems too late, the user
>> should actually unregister everything by close()?
> 
> The thing is, it's not the user that registered the notifiers, it's the
> vendor driver.  The vendor driver should know via the device release to
> unregister the notifier, which we're counting on to happen before the
> group release.  Can we rely on that ordering even in the case where a
> user is SIGKILL'd?
> 
>>> ---
>>>  drivers/vfio/vfio.c | 5 ++---
>>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>> index 6b9a98508939..083b581e87c0 100644
>>> --- a/drivers/vfio/vfio.c
>>> +++ b/drivers/vfio/vfio.c
>>> @@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
>>>  	struct iommu_group *iommu_group = group->iommu_group;
>>>  
>>>  	WARN_ON(!list_empty(&group->device_list));
>>> +	/* Any user didn't unregister? */
>>> +	WARN_ON(group->notifier.head);
> 
> Even if this is a bug, this is the wrong fix.  This is when the group
> is being destroyed.  Yes, it would be a bug to still have any notifiers
> here, but the intention is to make sure there are no notifiers when the
> group is idle and unused.

Out of curiosity - vendor drivers are supposed to hold a group file open,
not just reference vfio_grop/iommu_group objects?



> 
>>>  
>>>  	list_for_each_entry_safe(unbound, tmp,
>>>  				 &group->unbound_list, unbound_next) {
>>> @@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>>>  
>>>  	filep->private_data = NULL;
>>>  
>>> -	/* Any user didn't unregister? */
>>> -	WARN_ON(group->notifier.head);
>>> -
>>>  	vfio_group_try_dissolve_container(group);
>>>  
>>>  	atomic_dec(&group->opened);
>>>   
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released
@ 2016-12-19 22:41         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-19 22:41 UTC (permalink / raw)
  To: Alex Williamson, Jike Song
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 20/12/16 03:28, Alex Williamson wrote:
> On Mon, 19 Dec 2016 18:41:05 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 12/18/2016 09:28 AM, Alexey Kardashevskiy wrote:
>>> This moves a check for unregistered notifiers from fops release
>>> callback to the place where the group will actually be released.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> This is going to be used in the following patch in cleanup
>>> path. Since the next patch is RFC, this one might not be needed.  
> 
> Alexey, this is intended to be a bug fix, it should be sent separately,
> not buried in an unrelated patch series.

Well, I was pretty unsure this is a correct fix and I was sure that it
would make sense without the context which is quite weird :)

> 
>> I didn't find any use in the following patch 11/11, did you mean
>> something else?
>>
>> BTW the warning in vfio_group_release seems too late, the user
>> should actually unregister everything by close()?
> 
> The thing is, it's not the user that registered the notifiers, it's the
> vendor driver.  The vendor driver should know via the device release to
> unregister the notifier, which we're counting on to happen before the
> group release.  Can we rely on that ordering even in the case where a
> user is SIGKILL'd?
> 
>>> ---
>>>  drivers/vfio/vfio.c | 5 ++---
>>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>> index 6b9a98508939..083b581e87c0 100644
>>> --- a/drivers/vfio/vfio.c
>>> +++ b/drivers/vfio/vfio.c
>>> @@ -403,6 +403,8 @@ static void vfio_group_release(struct kref *kref)
>>>  	struct iommu_group *iommu_group = group->iommu_group;
>>>  
>>>  	WARN_ON(!list_empty(&group->device_list));
>>> +	/* Any user didn't unregister? */
>>> +	WARN_ON(group->notifier.head);
> 
> Even if this is a bug, this is the wrong fix.  This is when the group
> is being destroyed.  Yes, it would be a bug to still have any notifiers
> here, but the intention is to make sure there are no notifiers when the
> group is idle and unused.

Out of curiosity - vendor drivers are supposed to hold a group file open,
not just reference vfio_grop/iommu_group objects?



> 
>>>  
>>>  	list_for_each_entry_safe(unbound, tmp,
>>>  				 &group->unbound_list, unbound_next) {
>>> @@ -1584,9 +1586,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>>>  
>>>  	filep->private_data = NULL;
>>>  
>>> -	/* Any user didn't unregister? */
>>> -	WARN_ON(group->notifier.head);
>>> -
>>>  	vfio_group_try_dissolve_container(group);
>>>  
>>>  	atomic_dec(&group->opened);
>>>   
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-18  1:29   ` Alexey Kardashevskiy
  (?)
@ 2016-12-20  6:52     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-20  6:52 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Alex Williamson,
	Paul Mackerras, David Gibson

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN, this is done to simplify the cleanup and can be
improved later.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---

This obsoletes:

[PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
[PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
[PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO


So I have not reposted the whole thing, should have I?


btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.


---
 Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  88 +++++++++
 8 files changed, 594 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..3d281b7ea369 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..936138b866e7 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group, struct iommu_group *grp);
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 810f74317987..4088da4a575f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1068,6 +1068,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1089,6 +1090,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..008c4aee4df6 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,20 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (!fn)
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_liobn_release_iommu_group(
+		struct kvmppc_spapr_tce_table *stt,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
+
+	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
+		if (group && (stit->group != group))
+			continue;
+
+		list_del_rcu(&stit->next);
+
+		iommu_table_put(stit->tbl);
+		kvm_vfio_group_put_external_user(stit->group);
+
+		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+	}
+}
+
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
+		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group, struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+
+	f = fdget(tablefd);
+	if (!f.file)
+		return -EBADF;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt == f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found)
+		return -ENODEV;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		if ((tbltmp->it_page_shift == stt->page_shift) &&
+				(tbltmp->it_offset == stt->offset)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl)
+		return -ENODEV;
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+	return 0;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 
 	list_del_rcu(&stt->list);
 
+	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				stit->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8a6834e6e1c8..4d6f01712a6d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_iommu_table *stit;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					stit->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70963c845e96..0e555ba998c0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..3181054c8ff7 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 	return ret > 0;
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously
@@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		mutex_unlock(&kv->lock);
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(vfio_group, NULL);
 
 		kvm_vfio_group_put_external_user(vfio_group);
@@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group, grp);
+
+			iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-20  6:52     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-20  6:52 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, kvm-ppc, kvm,
	Alex Williamson, David Gibson

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN, this is done to simplify the cleanup and can be
improved later.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---

This obsoletes:

[PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
[PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
[PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO


So I have not reposted the whole thing, should have I?


btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.


---
 Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  88 +++++++++
 8 files changed, 594 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..3d281b7ea369 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..936138b866e7 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group, struct iommu_group *grp);
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 810f74317987..4088da4a575f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1068,6 +1068,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1089,6 +1090,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..008c4aee4df6 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,20 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (!fn)
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_liobn_release_iommu_group(
+		struct kvmppc_spapr_tce_table *stt,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
+
+	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
+		if (group && (stit->group != group))
+			continue;
+
+		list_del_rcu(&stit->next);
+
+		iommu_table_put(stit->tbl);
+		kvm_vfio_group_put_external_user(stit->group);
+
+		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+	}
+}
+
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
+		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group, struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+
+	f = fdget(tablefd);
+	if (!f.file)
+		return -EBADF;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt == f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found)
+		return -ENODEV;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		if ((tbltmp->it_page_shift == stt->page_shift) &&
+				(tbltmp->it_offset == stt->offset)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl)
+		return -ENODEV;
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+	return 0;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 
 	list_del_rcu(&stt->list);
 
+	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				stit->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8a6834e6e1c8..4d6f01712a6d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_iommu_table *stit;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					stit->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70963c845e96..0e555ba998c0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..3181054c8ff7 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 	return ret > 0;
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously
@@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		mutex_unlock(&kv->lock);
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(vfio_group, NULL);
 
 		kvm_vfio_group_put_external_user(vfio_group);
@@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group, grp);
+
+			iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-20  6:52     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-20  6:52 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Alex Williamson,
	Paul Mackerras, David Gibson

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN, this is done to simplify the cleanup and can be
improved later.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---

This obsoletes:

[PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
[PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
[PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO


So I have not reposted the whole thing, should have I?


btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.


---
 Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  88 +++++++++
 8 files changed, 594 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..3d281b7ea369 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..936138b866e7 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group, struct iommu_group *grp);
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 810f74317987..4088da4a575f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1068,6 +1068,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1089,6 +1090,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..008c4aee4df6 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,20 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (!fn)
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_liobn_release_iommu_group(
+		struct kvmppc_spapr_tce_table *stt,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
+
+	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
+		if (group && (stit->group != group))
+			continue;
+
+		list_del_rcu(&stit->next);
+
+		iommu_table_put(stit->tbl);
+		kvm_vfio_group_put_external_user(stit->group);
+
+		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+	}
+}
+
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
+		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group, struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+
+	f = fdget(tablefd);
+	if (!f.file)
+		return -EBADF;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt = f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found)
+		return -ENODEV;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		if ((tbltmp->it_page_shift = stt->page_shift) &&
+				(tbltmp->it_offset = stt->offset)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl)
+		return -ENODEV;
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+	return 0;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 
 	list_del_rcu(&stt->list);
 
+	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir = DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				stit->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8a6834e6e1c8..4d6f01712a6d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir = DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_iommu_table *stit;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					stit->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70963c845e96..0e555ba998c0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..3181054c8ff7 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 	return ret > 0;
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously
@@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		mutex_unlock(&kv->lock);
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(vfio_group, NULL);
 
 		kvm_vfio_group_put_external_user(vfio_group);
@@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group, grp);
+
+			iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
  2016-12-18  1:28   ` Alexey Kardashevskiy
@ 2016-12-21  4:08     ` David Gibson
  -1 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2016-12-21  4:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5832 bytes --]

On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL
> where declared (not in this patch).
> 
> If a requested chunk of memory has not been preregistered,
> this will fail with H_TOO_HARD so the virtual mode handle can
> handle the request.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v2:
> * updated the commit log with David's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>  1 file changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index d461c440889a..a3be4bd6188f 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> +{
> +	return mm_iommu_preregistered(vcpu->kvm->mm);
> +}
> +
> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> +{
> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> +}

I don't see that there's much point to these inlines.

>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (kvmppc_preregistered(vcpu)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;
>  
> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> +			return H_TOO_HARD;
> +	} else {
> +		/*
> +		 * This is emulated devices case.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */

Hmm.  So this isn't wrong as such, but the logic and comments are
both misleading.  The 'if' here isn't really about VFIO vs. emulated -
it's about whether the mm has *any* preregistered chunks, without any
regard to which particular device you're talking about.  For example
if your guest has two PHBs, one with VFIO devices and the other with
emulated devices, then the emulated devices will still go through the
"VFIO" case here.

Really what you have here is a fast case when the tce_list is in
preregistered memory, and a fallback case when it isn't.  But that's
obscured by the fact that if for whatever reason you have some
preregistered memory but it doesn't cover the tce_list, then you don't
go to the fallback case here, but instead fall right back to the
virtual mode handler.

So, I think you should either:
    1) Fallback to the code below whenever you can't access the
       tce_list via prereg memory, regardless of whether there's any
       _other_ prereg memory
or
    2) Drop the code below entirely and always return H_TOO_HARD if
       you can't get the tce_list from prereg.

> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;
> +
> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)
> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-12-21  4:08     ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2016-12-21  4:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5832 bytes --]

On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL
> where declared (not in this patch).
> 
> If a requested chunk of memory has not been preregistered,
> this will fail with H_TOO_HARD so the virtual mode handle can
> handle the request.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v2:
> * updated the commit log with David's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>  1 file changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index d461c440889a..a3be4bd6188f 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> +{
> +	return mm_iommu_preregistered(vcpu->kvm->mm);
> +}
> +
> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> +{
> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> +}

I don't see that there's much point to these inlines.

>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (kvmppc_preregistered(vcpu)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;
>  
> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> +			return H_TOO_HARD;
> +	} else {
> +		/*
> +		 * This is emulated devices case.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */

Hmm.  So this isn't wrong as such, but the logic and comments are
both misleading.  The 'if' here isn't really about VFIO vs. emulated -
it's about whether the mm has *any* preregistered chunks, without any
regard to which particular device you're talking about.  For example
if your guest has two PHBs, one with VFIO devices and the other with
emulated devices, then the emulated devices will still go through the
"VFIO" case here.

Really what you have here is a fast case when the tce_list is in
preregistered memory, and a fallback case when it isn't.  But that's
obscured by the fact that if for whatever reason you have some
preregistered memory but it doesn't cover the tce_list, then you don't
go to the fallback case here, but instead fall right back to the
virtual mode handler.

So, I think you should either:
    1) Fallback to the code below whenever you can't access the
       tce_list via prereg memory, regardless of whether there's any
       _other_ prereg memory
or
    2) Drop the code below entirely and always return H_TOO_HARD if
       you can't get the tce_list from prereg.

> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;
> +
> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)
> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
  2016-12-18  1:28   ` Alexey Kardashevskiy
@ 2016-12-21  6:04     ` David Gibson
  -1 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2016-12-21  6:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4501 bytes --]

On Sun, Dec 18, 2016 at 12:28:58PM +1100, Alexey Kardashevskiy wrote:
> c086de81 "vfio iommu: Add blocking notifier to notify DMA_UNMAP" added
> notifiers to a VFIO group. However even though the code underneath
> uses groups, the API takes device struct pointers.
> 
> This adds helpers which do the same thing but take IOMMU groups instead.
> 
> This adds vfio_iommu_group_set_kvm() which is a wrapper on top of
> vfio_group_set_kvm() but also takes an iommu_group.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Adding a second interface in parallel seems dubious.

Should the existing interface just be replaced with this one?

Or can the existing interface be re-implemented in terms of this one?

> ---
>  include/linux/vfio.h |  6 +++++
>  drivers/vfio/vfio.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 82 insertions(+)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index edf9b2cad277..8a3488ba4732 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -127,9 +127,15 @@ extern int vfio_register_notifier(struct device *dev,
>  extern int vfio_unregister_notifier(struct device *dev,
>  				    enum vfio_notify_type type,
>  				    struct notifier_block *nb);
> +extern int vfio_iommu_group_register_notifier(struct iommu_group *grp,
> +		enum vfio_notify_type type, unsigned long *events,
> +		struct notifier_block *nb);
> +extern int vfio_iommu_group_unregister_notifier(struct iommu_group *grp,
> +		enum vfio_notify_type type, struct notifier_block *nb);
>  
>  struct kvm;
>  extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
> +extern void vfio_iommu_group_set_kvm(struct iommu_group *grp, struct kvm *kvm);
>  
>  /*
>   * Sub-module helpers
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 9901c4671e2f..6b9a98508939 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2077,6 +2077,23 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
>  
> +void vfio_iommu_group_set_kvm(struct iommu_group *grp, struct kvm *kvm)
> +{
> +	struct vfio_group *group;
> +
> +	if (!grp)
> +		return;
> +
> +	group = vfio_group_get_from_iommu(grp);
> +	if (!group)
> +		return;
> +
> +	vfio_group_set_kvm(group, kvm);
> +
> +	vfio_group_put(group);
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_set_kvm);
> +
>  static int vfio_register_group_notifier(struct vfio_group *group,
>  					unsigned long *events,
>  					struct notifier_block *nb)
> @@ -2197,6 +2214,65 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
>  }
>  EXPORT_SYMBOL(vfio_unregister_notifier);
>  
> +int vfio_iommu_group_register_notifier(struct iommu_group *iommugroup,
> +		enum vfio_notify_type type,
> +		unsigned long *events, struct notifier_block *nb)
> +{
> +	struct vfio_group *group;
> +	int ret;
> +
> +	if (!iommugroup || !nb || !events || (*events == 0))
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_iommu(iommugroup);
> +	if (!group)
> +		return -ENODEV;
> +
> +	switch (type) {
> +	case VFIO_IOMMU_NOTIFY:
> +		ret = vfio_register_iommu_notifier(group, events, nb);
> +		break;
> +	case VFIO_GROUP_NOTIFY:
> +		ret = vfio_register_group_notifier(group, events, nb);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
> +
> +	vfio_group_put(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_iommu_group_register_notifier);
> +
> +int vfio_iommu_group_unregister_notifier(struct iommu_group *grp,
> +		enum vfio_notify_type type, struct notifier_block *nb)
> +{
> +	struct vfio_group *group;
> +	int ret;
> +
> +	if (!grp || !nb)
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_iommu(grp);
> +	if (!group)
> +		return -ENODEV;
> +
> +	switch (type) {
> +	case VFIO_IOMMU_NOTIFY:
> +		ret = vfio_unregister_iommu_notifier(group, nb);
> +		break;
> +	case VFIO_GROUP_NOTIFY:
> +		ret = vfio_unregister_group_notifier(group, nb);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
> +
> +	vfio_group_put(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_iommu_group_unregister_notifier);
> +
>  /**
>   * Module/class support
>   */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
@ 2016-12-21  6:04     ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2016-12-21  6:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4501 bytes --]

On Sun, Dec 18, 2016 at 12:28:58PM +1100, Alexey Kardashevskiy wrote:
> c086de81 "vfio iommu: Add blocking notifier to notify DMA_UNMAP" added
> notifiers to a VFIO group. However even though the code underneath
> uses groups, the API takes device struct pointers.
> 
> This adds helpers which do the same thing but take IOMMU groups instead.
> 
> This adds vfio_iommu_group_set_kvm() which is a wrapper on top of
> vfio_group_set_kvm() but also takes an iommu_group.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Adding a second interface in parallel seems dubious.

Should the existing interface just be replaced with this one?

Or can the existing interface be re-implemented in terms of this one?

> ---
>  include/linux/vfio.h |  6 +++++
>  drivers/vfio/vfio.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 82 insertions(+)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index edf9b2cad277..8a3488ba4732 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -127,9 +127,15 @@ extern int vfio_register_notifier(struct device *dev,
>  extern int vfio_unregister_notifier(struct device *dev,
>  				    enum vfio_notify_type type,
>  				    struct notifier_block *nb);
> +extern int vfio_iommu_group_register_notifier(struct iommu_group *grp,
> +		enum vfio_notify_type type, unsigned long *events,
> +		struct notifier_block *nb);
> +extern int vfio_iommu_group_unregister_notifier(struct iommu_group *grp,
> +		enum vfio_notify_type type, struct notifier_block *nb);
>  
>  struct kvm;
>  extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
> +extern void vfio_iommu_group_set_kvm(struct iommu_group *grp, struct kvm *kvm);
>  
>  /*
>   * Sub-module helpers
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 9901c4671e2f..6b9a98508939 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2077,6 +2077,23 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
>  
> +void vfio_iommu_group_set_kvm(struct iommu_group *grp, struct kvm *kvm)
> +{
> +	struct vfio_group *group;
> +
> +	if (!grp)
> +		return;
> +
> +	group = vfio_group_get_from_iommu(grp);
> +	if (!group)
> +		return;
> +
> +	vfio_group_set_kvm(group, kvm);
> +
> +	vfio_group_put(group);
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_set_kvm);
> +
>  static int vfio_register_group_notifier(struct vfio_group *group,
>  					unsigned long *events,
>  					struct notifier_block *nb)
> @@ -2197,6 +2214,65 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
>  }
>  EXPORT_SYMBOL(vfio_unregister_notifier);
>  
> +int vfio_iommu_group_register_notifier(struct iommu_group *iommugroup,
> +		enum vfio_notify_type type,
> +		unsigned long *events, struct notifier_block *nb)
> +{
> +	struct vfio_group *group;
> +	int ret;
> +
> +	if (!iommugroup || !nb || !events || (*events == 0))
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_iommu(iommugroup);
> +	if (!group)
> +		return -ENODEV;
> +
> +	switch (type) {
> +	case VFIO_IOMMU_NOTIFY:
> +		ret = vfio_register_iommu_notifier(group, events, nb);
> +		break;
> +	case VFIO_GROUP_NOTIFY:
> +		ret = vfio_register_group_notifier(group, events, nb);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
> +
> +	vfio_group_put(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_iommu_group_register_notifier);
> +
> +int vfio_iommu_group_unregister_notifier(struct iommu_group *grp,
> +		enum vfio_notify_type type, struct notifier_block *nb)
> +{
> +	struct vfio_group *group;
> +	int ret;
> +
> +	if (!grp || !nb)
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_iommu(grp);
> +	if (!group)
> +		return -ENODEV;
> +
> +	switch (type) {
> +	case VFIO_IOMMU_NOTIFY:
> +		ret = vfio_unregister_iommu_notifier(group, nb);
> +		break;
> +	case VFIO_GROUP_NOTIFY:
> +		ret = vfio_unregister_group_notifier(group, nb);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
> +
> +	vfio_group_put(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_iommu_group_unregister_notifier);
> +
>  /**
>   * Module/class support
>   */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
  2016-12-21  4:08     ` David Gibson
@ 2016-12-21  8:57       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-21  8:57 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 7149 bytes --]

On 21/12/16 15:08, David Gibson wrote:
> On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
>> VFIO on sPAPR already implements guest memory pre-registration
>> when the entire guest RAM gets pinned. This can be used to translate
>> the physical address of a guest page containing the TCE list
>> from H_PUT_TCE_INDIRECT.
>>
>> This makes use of the pre-registrered memory API to access TCE list
>> pages in order to avoid unnecessary locking on the KVM memory
>> reverse map as we know that all of guest memory is pinned and
>> we have a flat array mapping GPA to HPA which makes it simpler and
>> quicker to index into that array (even with looking up the
>> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
>> lock the rmap entry, look up the user page tables, and unlock the rmap
>> entry. Note that the rmap pointer is initialized to NULL
>> where declared (not in this patch).
>>
>> If a requested chunk of memory has not been preregistered,
>> this will fail with H_TOO_HARD so the virtual mode handle can
>> handle the request.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v2:
>> * updated the commit log with David's comment
>> ---
>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>>  1 file changed, 49 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index d461c440889a..a3be4bd6188f 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>  
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>> +{
>> +	return mm_iommu_preregistered(vcpu->kvm->mm);
>> +}
>> +
>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>> +{
>> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>> +}
> 
> I don't see that there's much point to these inlines.
> 
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>> -		return H_TOO_HARD;
>> +	if (kvmppc_preregistered(vcpu)) {
>> +		/*
>> +		 * We get here if guest memory was pre-registered which
>> +		 * is normally VFIO case and gpa->hpa translation does not
>> +		 * depend on hpt.
>> +		 */
>> +		struct mm_iommu_table_group_mem_t *mem;
>>  
>> -	rmap = (void *) vmalloc_to_phys(rmap);
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>> +			return H_TOO_HARD;
>>  
>> -	/*
>> -	 * Synchronize with the MMU notifier callbacks in
>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>> -	 * While we have the rmap lock, code running on other CPUs
>> -	 * cannot finish unmapping the host real page that backs
>> -	 * this guest real page, so we are OK to access the host
>> -	 * real page.
>> -	 */
>> -	lock_rmap(rmap);
>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>> -		ret = H_TOO_HARD;
>> -		goto unlock_exit;
>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>> +			return H_TOO_HARD;
>> +	} else {
>> +		/*
>> +		 * This is emulated devices case.
>> +		 * We do not require memory to be preregistered in this case
>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>> +		 */
> 
> Hmm.  So this isn't wrong as such, but the logic and comments are
> both misleading.  The 'if' here isn't really about VFIO vs. emulated -
> it's about whether the mm has *any* preregistered chunks, without any
> regard to which particular device you're talking about.  For example
> if your guest has two PHBs, one with VFIO devices and the other with
> emulated devices, then the emulated devices will still go through the
> "VFIO" case here.

kvmppc_preregistered() checks for a single pointer, kvmppc_rm_ua_to_hpa()
goes through __find_linux_pte_or_hugepte() which is unnecessary
complication here.

s/emulated devices case/case of a guest with emulated devices only/ ?


> Really what you have here is a fast case when the tce_list is in
> preregistered memory, and a fallback case when it isn't.  But that's
> obscured by the fact that if for whatever reason you have some
> preregistered memory but it doesn't cover the tce_list, then you don't
> go to the fallback case here, but instead fall right back to the
> virtual mode handler.

This is purely acceleration, I am trying to make obvious cases faster, and
other cases safer. If some chunk is not preregistered but others are and
there is H_PUT_TCE_INDIRECT with tce_list from non-preregistered memory,
then I have no idea what this userspace is and what it is doing, so I just
do not want to accelerate anything for it in real mode (I have poor
imagination and since I cannot test it - I better drop it).


> So, I think you should either:
>     1) Fallback to the code below whenever you can't access the
>        tce_list via prereg memory, regardless of whether there's any
>        _other_ prereg memory

Using prereg was the entire point here as if something goes wrong (i.e.
some TCE fails to translate), I may stop in a middle of TCE list and will
have to do complicated rollback to pass the request in the virtual mode to
finish processing (note that there is nothing to revert when it is
emulated-devices-only-guest).


> or
>     2) Drop the code below entirely and always return H_TOO_HARD if
>        you can't get the tce_list from prereg.

This path cannot fail for emulated device and it is really fast path, why
to drop it?


I am _really_ confused now. In few last respins this was not a concern, now
it is - is the patch this bad and 100% needs to be reworked? I am trying to
push it last 3 years now :(



> 
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>> +			return H_TOO_HARD;
>> +
>> +		rmap = (void *) vmalloc_to_phys(rmap);
>> +
>> +		/*
>> +		 * Synchronize with the MMU notifier callbacks in
>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>> +		 * While we have the rmap lock, code running on other CPUs
>> +		 * cannot finish unmapping the host real page that backs
>> +		 * this guest real page, so we are OK to access the host
>> +		 * real page.
>> +		 */
>> +		lock_rmap(rmap);
>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>> +			ret = H_TOO_HARD;
>> +			goto unlock_exit;
>> +		}
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  
>>  unlock_exit:
>> -	unlock_rmap(rmap);
>> +	if (rmap)
>> +		unlock_rmap(rmap);
>>  
>>  	return ret;
>>  }
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-12-21  8:57       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-21  8:57 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 7149 bytes --]

On 21/12/16 15:08, David Gibson wrote:
> On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
>> VFIO on sPAPR already implements guest memory pre-registration
>> when the entire guest RAM gets pinned. This can be used to translate
>> the physical address of a guest page containing the TCE list
>> from H_PUT_TCE_INDIRECT.
>>
>> This makes use of the pre-registrered memory API to access TCE list
>> pages in order to avoid unnecessary locking on the KVM memory
>> reverse map as we know that all of guest memory is pinned and
>> we have a flat array mapping GPA to HPA which makes it simpler and
>> quicker to index into that array (even with looking up the
>> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
>> lock the rmap entry, look up the user page tables, and unlock the rmap
>> entry. Note that the rmap pointer is initialized to NULL
>> where declared (not in this patch).
>>
>> If a requested chunk of memory has not been preregistered,
>> this will fail with H_TOO_HARD so the virtual mode handle can
>> handle the request.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v2:
>> * updated the commit log with David's comment
>> ---
>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>>  1 file changed, 49 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index d461c440889a..a3be4bd6188f 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>  
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>> +{
>> +	return mm_iommu_preregistered(vcpu->kvm->mm);
>> +}
>> +
>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>> +{
>> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>> +}
> 
> I don't see that there's much point to these inlines.
> 
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>> -		return H_TOO_HARD;
>> +	if (kvmppc_preregistered(vcpu)) {
>> +		/*
>> +		 * We get here if guest memory was pre-registered which
>> +		 * is normally VFIO case and gpa->hpa translation does not
>> +		 * depend on hpt.
>> +		 */
>> +		struct mm_iommu_table_group_mem_t *mem;
>>  
>> -	rmap = (void *) vmalloc_to_phys(rmap);
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>> +			return H_TOO_HARD;
>>  
>> -	/*
>> -	 * Synchronize with the MMU notifier callbacks in
>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>> -	 * While we have the rmap lock, code running on other CPUs
>> -	 * cannot finish unmapping the host real page that backs
>> -	 * this guest real page, so we are OK to access the host
>> -	 * real page.
>> -	 */
>> -	lock_rmap(rmap);
>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>> -		ret = H_TOO_HARD;
>> -		goto unlock_exit;
>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>> +			return H_TOO_HARD;
>> +	} else {
>> +		/*
>> +		 * This is emulated devices case.
>> +		 * We do not require memory to be preregistered in this case
>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>> +		 */
> 
> Hmm.  So this isn't wrong as such, but the logic and comments are
> both misleading.  The 'if' here isn't really about VFIO vs. emulated -
> it's about whether the mm has *any* preregistered chunks, without any
> regard to which particular device you're talking about.  For example
> if your guest has two PHBs, one with VFIO devices and the other with
> emulated devices, then the emulated devices will still go through the
> "VFIO" case here.

kvmppc_preregistered() checks for a single pointer, kvmppc_rm_ua_to_hpa()
goes through __find_linux_pte_or_hugepte() which is unnecessary
complication here.

s/emulated devices case/case of a guest with emulated devices only/ ?


> Really what you have here is a fast case when the tce_list is in
> preregistered memory, and a fallback case when it isn't.  But that's
> obscured by the fact that if for whatever reason you have some
> preregistered memory but it doesn't cover the tce_list, then you don't
> go to the fallback case here, but instead fall right back to the
> virtual mode handler.

This is purely acceleration, I am trying to make obvious cases faster, and
other cases safer. If some chunk is not preregistered but others are and
there is H_PUT_TCE_INDIRECT with tce_list from non-preregistered memory,
then I have no idea what this userspace is and what it is doing, so I just
do not want to accelerate anything for it in real mode (I have poor
imagination and since I cannot test it - I better drop it).


> So, I think you should either:
>     1) Fallback to the code below whenever you can't access the
>        tce_list via prereg memory, regardless of whether there's any
>        _other_ prereg memory

Using prereg was the entire point here as if something goes wrong (i.e.
some TCE fails to translate), I may stop in a middle of TCE list and will
have to do complicated rollback to pass the request in the virtual mode to
finish processing (note that there is nothing to revert when it is
emulated-devices-only-guest).


> or
>     2) Drop the code below entirely and always return H_TOO_HARD if
>        you can't get the tce_list from prereg.

This path cannot fail for emulated device and it is really fast path, why
to drop it?


I am _really_ confused now. In few last respins this was not a concern, now
it is - is the patch this bad and 100% needs to be reworked? I am trying to
push it last 3 years now :(



> 
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>> +			return H_TOO_HARD;
>> +
>> +		rmap = (void *) vmalloc_to_phys(rmap);
>> +
>> +		/*
>> +		 * Synchronize with the MMU notifier callbacks in
>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>> +		 * While we have the rmap lock, code running on other CPUs
>> +		 * cannot finish unmapping the host real page that backs
>> +		 * this guest real page, so we are OK to access the host
>> +		 * real page.
>> +		 */
>> +		lock_rmap(rmap);
>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>> +			ret = H_TOO_HARD;
>> +			goto unlock_exit;
>> +		}
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  
>>  unlock_exit:
>> -	unlock_rmap(rmap);
>> +	if (rmap)
>> +		unlock_rmap(rmap);
>>  
>>  	return ret;
>>  }
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
  2016-12-21  6:04     ` David Gibson
@ 2016-12-22  1:25       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-22  1:25 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 1148 bytes --]

On 21/12/16 17:04, David Gibson wrote:
> On Sun, Dec 18, 2016 at 12:28:58PM +1100, Alexey Kardashevskiy wrote:
>> c086de81 "vfio iommu: Add blocking notifier to notify DMA_UNMAP" added
>> notifiers to a VFIO group. However even though the code underneath
>> uses groups, the API takes device struct pointers.
>>
>> This adds helpers which do the same thing but take IOMMU groups instead.
>>
>> This adds vfio_iommu_group_set_kvm() which is a wrapper on top of
>> vfio_group_set_kvm() but also takes an iommu_group.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> Adding a second interface in parallel seems dubious.
> 
> Should the existing interface just be replaced with this one?
> 
> Or can the existing interface be re-implemented in terms of this one?

imho this should have been done in the first place but since Alex and I
came to a conclusion that this does not simplify anything in my patchset
(rather the opposite), I am not going to push it further now.

09/11, 10/11, 11/11 from this patchset are superseded by:
[PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
@ 2016-12-22  1:25       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-22  1:25 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 1148 bytes --]

On 21/12/16 17:04, David Gibson wrote:
> On Sun, Dec 18, 2016 at 12:28:58PM +1100, Alexey Kardashevskiy wrote:
>> c086de81 "vfio iommu: Add blocking notifier to notify DMA_UNMAP" added
>> notifiers to a VFIO group. However even though the code underneath
>> uses groups, the API takes device struct pointers.
>>
>> This adds helpers which do the same thing but take IOMMU groups instead.
>>
>> This adds vfio_iommu_group_set_kvm() which is a wrapper on top of
>> vfio_group_set_kvm() but also takes an iommu_group.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> Adding a second interface in parallel seems dubious.
> 
> Should the existing interface just be replaced with this one?
> 
> Or can the existing interface be re-implemented in terms of this one?

imho this should have been done in the first place but since Alex and I
came to a conclusion that this does not simplify anything in my patchset
(rather the opposite), I am not going to push it further now.

09/11, 10/11, 11/11 from this patchset are superseded by:
[PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
  2016-12-21  8:57       ` Alexey Kardashevskiy
@ 2017-01-11  6:35         ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2017-01-11  6:35 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 7431 bytes --]

On 21/12/16 19:57, Alexey Kardashevskiy wrote:
> On 21/12/16 15:08, David Gibson wrote:
>> On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
>>> VFIO on sPAPR already implements guest memory pre-registration
>>> when the entire guest RAM gets pinned. This can be used to translate
>>> the physical address of a guest page containing the TCE list
>>> from H_PUT_TCE_INDIRECT.
>>>
>>> This makes use of the pre-registrered memory API to access TCE list
>>> pages in order to avoid unnecessary locking on the KVM memory
>>> reverse map as we know that all of guest memory is pinned and
>>> we have a flat array mapping GPA to HPA which makes it simpler and
>>> quicker to index into that array (even with looking up the
>>> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
>>> lock the rmap entry, look up the user page tables, and unlock the rmap
>>> entry. Note that the rmap pointer is initialized to NULL
>>> where declared (not in this patch).
>>>
>>> If a requested chunk of memory has not been preregistered,
>>> this will fail with H_TOO_HARD so the virtual mode handle can
>>> handle the request.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>> Changes:
>>> v2:
>>> * updated the commit log with David's comment
>>> ---
>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>>>  1 file changed, 49 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> index d461c440889a..a3be4bd6188f 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>>  
>>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>>> +{
>>> +	return mm_iommu_preregistered(vcpu->kvm->mm);
>>> +}
>>> +
>>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>>> +{
>>> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>>> +}
>>
>> I don't see that there's much point to these inlines.
>>
>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>  		unsigned long ioba, unsigned long tce)
>>>  {
>>> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>  	if (ret != H_SUCCESS)
>>>  		return ret;
>>>  
>>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>> -		return H_TOO_HARD;
>>> +	if (kvmppc_preregistered(vcpu)) {
>>> +		/*
>>> +		 * We get here if guest memory was pre-registered which
>>> +		 * is normally VFIO case and gpa->hpa translation does not
>>> +		 * depend on hpt.
>>> +		 */
>>> +		struct mm_iommu_table_group_mem_t *mem;
>>>  
>>> -	rmap = (void *) vmalloc_to_phys(rmap);
>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>> +			return H_TOO_HARD;
>>>  
>>> -	/*
>>> -	 * Synchronize with the MMU notifier callbacks in
>>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>> -	 * While we have the rmap lock, code running on other CPUs
>>> -	 * cannot finish unmapping the host real page that backs
>>> -	 * this guest real page, so we are OK to access the host
>>> -	 * real page.
>>> -	 */
>>> -	lock_rmap(rmap);
>>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>> -		ret = H_TOO_HARD;
>>> -		goto unlock_exit;
>>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>>> +			return H_TOO_HARD;
>>> +	} else {
>>> +		/*
>>> +		 * This is emulated devices case.
>>> +		 * We do not require memory to be preregistered in this case
>>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>>> +		 */
>>
>> Hmm.  So this isn't wrong as such, but the logic and comments are
>> both misleading.  The 'if' here isn't really about VFIO vs. emulated -
>> it's about whether the mm has *any* preregistered chunks, without any
>> regard to which particular device you're talking about.  For example
>> if your guest has two PHBs, one with VFIO devices and the other with
>> emulated devices, then the emulated devices will still go through the
>> "VFIO" case here.
> 
> kvmppc_preregistered() checks for a single pointer, kvmppc_rm_ua_to_hpa()
> goes through __find_linux_pte_or_hugepte() which is unnecessary
> complication here.
> 
> s/emulated devices case/case of a guest with emulated devices only/ ?
> 
> 
>> Really what you have here is a fast case when the tce_list is in
>> preregistered memory, and a fallback case when it isn't.  But that's
>> obscured by the fact that if for whatever reason you have some
>> preregistered memory but it doesn't cover the tce_list, then you don't
>> go to the fallback case here, but instead fall right back to the
>> virtual mode handler.
> 
> This is purely acceleration, I am trying to make obvious cases faster, and
> other cases safer. If some chunk is not preregistered but others are and
> there is H_PUT_TCE_INDIRECT with tce_list from non-preregistered memory,
> then I have no idea what this userspace is and what it is doing, so I just
> do not want to accelerate anything for it in real mode (I have poor
> imagination and since I cannot test it - I better drop it).
> 
> 
>> So, I think you should either:
>>     1) Fallback to the code below whenever you can't access the
>>        tce_list via prereg memory, regardless of whether there's any
>>        _other_ prereg memory
> 
> Using prereg was the entire point here as if something goes wrong (i.e.
> some TCE fails to translate), I may stop in a middle of TCE list and will
> have to do complicated rollback to pass the request in the virtual mode to
> finish processing (note that there is nothing to revert when it is
> emulated-devices-only-guest).
> 
> 
>> or
>>     2) Drop the code below entirely and always return H_TOO_HARD if
>>        you can't get the tce_list from prereg.
> 
> This path cannot fail for emulated device and it is really fast path, why
> to drop it?
> 
> 
> I am _really_ confused now. In few last respins this was not a concern, now
> it is - is the patch this bad and 100% needs to be reworked? I am trying to
> push it last 3 years now :(
> 


Ping?


> 
>>
>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>> +			return H_TOO_HARD;
>>> +
>>> +		rmap = (void *) vmalloc_to_phys(rmap);
>>> +
>>> +		/*
>>> +		 * Synchronize with the MMU notifier callbacks in
>>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>> +		 * While we have the rmap lock, code running on other CPUs
>>> +		 * cannot finish unmapping the host real page that backs
>>> +		 * this guest real page, so we are OK to access the host
>>> +		 * real page.
>>> +		 */
>>> +		lock_rmap(rmap);
>>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>> +			ret = H_TOO_HARD;
>>> +			goto unlock_exit;
>>> +		}
>>>  	}
>>>  
>>>  	for (i = 0; i < npages; ++i) {
>>> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>  	}
>>>  
>>>  unlock_exit:
>>> -	unlock_rmap(rmap);
>>> +	if (rmap)
>>> +		unlock_rmap(rmap);
>>>  
>>>  	return ret;
>>>  }
>>
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
@ 2017-01-11  6:35         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2017-01-11  6:35 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 7431 bytes --]

On 21/12/16 19:57, Alexey Kardashevskiy wrote:
> On 21/12/16 15:08, David Gibson wrote:
>> On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
>>> VFIO on sPAPR already implements guest memory pre-registration
>>> when the entire guest RAM gets pinned. This can be used to translate
>>> the physical address of a guest page containing the TCE list
>>> from H_PUT_TCE_INDIRECT.
>>>
>>> This makes use of the pre-registrered memory API to access TCE list
>>> pages in order to avoid unnecessary locking on the KVM memory
>>> reverse map as we know that all of guest memory is pinned and
>>> we have a flat array mapping GPA to HPA which makes it simpler and
>>> quicker to index into that array (even with looking up the
>>> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
>>> lock the rmap entry, look up the user page tables, and unlock the rmap
>>> entry. Note that the rmap pointer is initialized to NULL
>>> where declared (not in this patch).
>>>
>>> If a requested chunk of memory has not been preregistered,
>>> this will fail with H_TOO_HARD so the virtual mode handle can
>>> handle the request.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>> Changes:
>>> v2:
>>> * updated the commit log with David's comment
>>> ---
>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>>>  1 file changed, 49 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> index d461c440889a..a3be4bd6188f 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>>  
>>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>>> +{
>>> +	return mm_iommu_preregistered(vcpu->kvm->mm);
>>> +}
>>> +
>>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>>> +{
>>> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>>> +}
>>
>> I don't see that there's much point to these inlines.
>>
>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>  		unsigned long ioba, unsigned long tce)
>>>  {
>>> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>  	if (ret != H_SUCCESS)
>>>  		return ret;
>>>  
>>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>> -		return H_TOO_HARD;
>>> +	if (kvmppc_preregistered(vcpu)) {
>>> +		/*
>>> +		 * We get here if guest memory was pre-registered which
>>> +		 * is normally VFIO case and gpa->hpa translation does not
>>> +		 * depend on hpt.
>>> +		 */
>>> +		struct mm_iommu_table_group_mem_t *mem;
>>>  
>>> -	rmap = (void *) vmalloc_to_phys(rmap);
>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>> +			return H_TOO_HARD;
>>>  
>>> -	/*
>>> -	 * Synchronize with the MMU notifier callbacks in
>>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>> -	 * While we have the rmap lock, code running on other CPUs
>>> -	 * cannot finish unmapping the host real page that backs
>>> -	 * this guest real page, so we are OK to access the host
>>> -	 * real page.
>>> -	 */
>>> -	lock_rmap(rmap);
>>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>> -		ret = H_TOO_HARD;
>>> -		goto unlock_exit;
>>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>>> +			return H_TOO_HARD;
>>> +	} else {
>>> +		/*
>>> +		 * This is emulated devices case.
>>> +		 * We do not require memory to be preregistered in this case
>>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>>> +		 */
>>
>> Hmm.  So this isn't wrong as such, but the logic and comments are
>> both misleading.  The 'if' here isn't really about VFIO vs. emulated -
>> it's about whether the mm has *any* preregistered chunks, without any
>> regard to which particular device you're talking about.  For example
>> if your guest has two PHBs, one with VFIO devices and the other with
>> emulated devices, then the emulated devices will still go through the
>> "VFIO" case here.
> 
> kvmppc_preregistered() checks for a single pointer, kvmppc_rm_ua_to_hpa()
> goes through __find_linux_pte_or_hugepte() which is unnecessary
> complication here.
> 
> s/emulated devices case/case of a guest with emulated devices only/ ?
> 
> 
>> Really what you have here is a fast case when the tce_list is in
>> preregistered memory, and a fallback case when it isn't.  But that's
>> obscured by the fact that if for whatever reason you have some
>> preregistered memory but it doesn't cover the tce_list, then you don't
>> go to the fallback case here, but instead fall right back to the
>> virtual mode handler.
> 
> This is purely acceleration, I am trying to make obvious cases faster, and
> other cases safer. If some chunk is not preregistered but others are and
> there is H_PUT_TCE_INDIRECT with tce_list from non-preregistered memory,
> then I have no idea what this userspace is and what it is doing, so I just
> do not want to accelerate anything for it in real mode (I have poor
> imagination and since I cannot test it - I better drop it).
> 
> 
>> So, I think you should either:
>>     1) Fallback to the code below whenever you can't access the
>>        tce_list via prereg memory, regardless of whether there's any
>>        _other_ prereg memory
> 
> Using prereg was the entire point here as if something goes wrong (i.e.
> some TCE fails to translate), I may stop in a middle of TCE list and will
> have to do complicated rollback to pass the request in the virtual mode to
> finish processing (note that there is nothing to revert when it is
> emulated-devices-only-guest).
> 
> 
>> or
>>     2) Drop the code below entirely and always return H_TOO_HARD if
>>        you can't get the tce_list from prereg.
> 
> This path cannot fail for emulated device and it is really fast path, why
> to drop it?
> 
> 
> I am _really_ confused now. In few last respins this was not a concern, now
> it is - is the patch this bad and 100% needs to be reworked? I am trying to
> push it last 3 years now :(
> 


Ping?


> 
>>
>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>> +			return H_TOO_HARD;
>>> +
>>> +		rmap = (void *) vmalloc_to_phys(rmap);
>>> +
>>> +		/*
>>> +		 * Synchronize with the MMU notifier callbacks in
>>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>> +		 * While we have the rmap lock, code running on other CPUs
>>> +		 * cannot finish unmapping the host real page that backs
>>> +		 * this guest real page, so we are OK to access the host
>>> +		 * real page.
>>> +		 */
>>> +		lock_rmap(rmap);
>>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>> +			ret = H_TOO_HARD;
>>> +			goto unlock_exit;
>>> +		}
>>>  	}
>>>  
>>>  	for (i = 0; i < npages; ++i) {
>>> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>  	}
>>>  
>>>  unlock_exit:
>>> -	unlock_rmap(rmap);
>>> +	if (rmap)
>>> +		unlock_rmap(rmap);
>>>  
>>>  	return ret;
>>>  }
>>
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-20  6:52     ` Alexey Kardashevskiy
@ 2017-01-12  5:04       ` David Gibson
  -1 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2017-01-12  5:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 32135 bytes --]

On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN, this is done to simplify the cleanup and can be
> improved later.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This obsoletes:
> 
> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
> 
> 
> So I have not reposted the whole thing, should have I?
> 
> 
> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  88 +++++++++
>  8 files changed, 594 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 28350a294b1e..3d281b7ea369 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 0a21c8503974..936138b866e7 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group, struct iommu_group *grp);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 810f74317987..4088da4a575f 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 15df8ae627d9..008c4aee4df6 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,20 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (!fn)

I think this should have a WARN_ON().  If the vfio module is gone
while you still have VFIO groups attached to a KVM table, something
has gone horribly wrong.

> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		iommu_table_put(stit->tbl);
> +		kvm_vfio_group_put_external_user(stit->group);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group, struct iommu_group *grp)

Isn't passing both the vfio_group and the iommu_group redundant?

> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +
> +	f = fdget(tablefd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found)
> +		return -ENODEV;

Not entirely sure if ENODEV is the right error, but I can't
immediately think of a better one.

> +	table_group = iommu_group_get_iommudata(grp);
> +	if (!table_group)
> +		return -EFAULT;

EFAULT is usually only returned when you pass a syscall a bad pointer,
which doesn't look to be the case here.  What situation does this
error path actually represent?

> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl)
> +		return -ENODEV;
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);

Won't this add a separate stit entry for each group attached to the
LIOBN, even if those groups share a single hardware iommu table -
which is the likely case if those groups have all been put into the
same container.

> +	return 0;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_HARDWARE;

IIUC, this error represents trying to unmap a page from the vIOMMU,
and discovering that it wasn't preregistered in the first place, which
shouldn't happen.  So would a WARN_ON() make sense here as well as the
H_HARDWARE.

> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_HARDWARE;

This would represent the guest trying to map a mad GPA, yes?  In which
case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.

> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_HARDWARE;

Here H_HARDWARE seems right. IIUC this represents the guest trying to
map an address which wasn't pre-registered.  That would indicate a bug
in qemu, which is hardware as far as the guest is concerned.

> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;

Not sure what this case represents.

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;

Or this.

> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	long idx, ret = H_HARDWARE;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> +
> +	/* Clear TCE */
> +	if (dir == DMA_NONE) {
> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> +			return H_PARAMETER;
> +
> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
> +	}
> +
> +	/* Put TCE */
> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> +		return H_PARAMETER;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long ioba,
> +		u64 __user *tces, unsigned long npages)
> +{
> +	unsigned long i, ret, tce, gpa;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	for (i = 0; i < npages; ++i) {
> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);

IIUC this is the virtual mode, not the real mode version.  In which
case you shouldn't be accessing tces[i] (a userspace pointeR) directly
bit should instead be using get_user().

> +		if (iommu_tce_put_param_check(tbl, ioba +
> +				(i << tbl->it_page_shift), gpa))
> +			return H_PARAMETER;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(tces[i]);
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
> +				iommu_tce_direction(tce));
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	unsigned long i;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
> +
> +	return H_SUCCESS;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {

As noted above, AFAICT there is one stit per group, rather than per
backend IOMMU table, so if there are multiple groups in the same
container (and therefore attached to the same LIOBN), won't this mean
we duplicate this operation a bunch of times?

> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long entry, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
> +				stit->tbl, ioba, tces, npages);
> +		if (ret != H_SUCCESS)
> +			goto unlock_exit;

Hmm, I don't suppose you could simplify things by not having a
put_tce_indirect() version of the whole backend iommu mapping
function, but just a single-TCE version, and instead looping across
the backend IOMMU tables as you put each indirect entry in .

> +	}
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
> +				tce_value, npages);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 8a6834e6e1c8..4d6f01712a6d 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>  }
>  
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_SUCCESS;

What case is this?  Not being able to find the userspace duesn't sound
like a success.

> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_SUCCESS;

And again..

> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
> +	if (!mem)
> +		return H_HARDWARE;

Should this have a WARN_ON?

> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_HARDWARE;

Again H_HARDWARE doesn't seem right here.

> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long liobn,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> +
> +	/* Clear TCE */
> +	if (dir == DMA_NONE) {
> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> +			return H_PARAMETER;
> +
> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
> +	}
> +
> +	/* Put TCE */
> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> +		return H_PARAMETER;
> +
> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
> +}
> +
> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long ioba,
> +		u64 *tces, unsigned long npages)
> +{
> +	unsigned long i, ret, tce, gpa;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	for (i = 0; i < npages; ++i) {
> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		if (iommu_tce_put_param_check(tbl, ioba +
> +				(i << tbl->it_page_shift), gpa))
> +			return H_PARAMETER;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(tces[i]);
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
> +				iommu_tce_direction(tce));
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	unsigned long i;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
> +
> +	return H_SUCCESS;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
> +				liobn, ioba, tce);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		 * depend on hpt.
>  		 */
>  		struct mm_iommu_table_group_mem_t *mem;
> +		struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>  			return H_TOO_HARD;
> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>  			return H_TOO_HARD;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
> +					stit->tbl, ioba, (u64 *)tces, npages);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
>  	} else {
>  		/*
>  		 * This is emulated devices case.
> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
> +				liobn, ioba, tce_value, npages);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 70963c845e96..0e555ba998c0 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_ALLOC_HTAB:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..3181054c8ff7 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>  	return ret > 0;
>  }
>  
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
> +
>  /*
>   * Groups can use the same or different IOMMU domains.  If the same then
>   * adding a new group may change the coherency of groups we've previously
> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			int group_id;
> +			struct iommu_group *grp;
> +
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			group_id = kvm_vfio_external_user_iommu_id(
> +					kvg->vfio_group);
> +			grp = iommu_group_get_by_id(group_id);
> +			if (!grp) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group, grp);
> +
> +			iommu_group_put(grp);
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}

Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
path to detach the group from all LIOBNs,  Or else just fail if if
there are LIOBNs attached.  I think it would be a qemu bug not to
detach the LIOBNs before removing the group, but we stil need to
protect the host in that case.

>  
>  	return -ENXIO;
> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2017-01-12  5:04       ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2017-01-12  5:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 32135 bytes --]

On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN, this is done to simplify the cleanup and can be
> improved later.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This obsoletes:
> 
> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
> 
> 
> So I have not reposted the whole thing, should have I?
> 
> 
> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  88 +++++++++
>  8 files changed, 594 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 28350a294b1e..3d281b7ea369 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 0a21c8503974..936138b866e7 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group, struct iommu_group *grp);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 810f74317987..4088da4a575f 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 15df8ae627d9..008c4aee4df6 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,20 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (!fn)

I think this should have a WARN_ON().  If the vfio module is gone
while you still have VFIO groups attached to a KVM table, something
has gone horribly wrong.

> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		iommu_table_put(stit->tbl);
> +		kvm_vfio_group_put_external_user(stit->group);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group, struct iommu_group *grp)

Isn't passing both the vfio_group and the iommu_group redundant?

> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +
> +	f = fdget(tablefd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found)
> +		return -ENODEV;

Not entirely sure if ENODEV is the right error, but I can't
immediately think of a better one.

> +	table_group = iommu_group_get_iommudata(grp);
> +	if (!table_group)
> +		return -EFAULT;

EFAULT is usually only returned when you pass a syscall a bad pointer,
which doesn't look to be the case here.  What situation does this
error path actually represent?

> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl)
> +		return -ENODEV;
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);

Won't this add a separate stit entry for each group attached to the
LIOBN, even if those groups share a single hardware iommu table -
which is the likely case if those groups have all been put into the
same container.

> +	return 0;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_HARDWARE;

IIUC, this error represents trying to unmap a page from the vIOMMU,
and discovering that it wasn't preregistered in the first place, which
shouldn't happen.  So would a WARN_ON() make sense here as well as the
H_HARDWARE.

> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_HARDWARE;

This would represent the guest trying to map a mad GPA, yes?  In which
case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.

> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_HARDWARE;

Here H_HARDWARE seems right. IIUC this represents the guest trying to
map an address which wasn't pre-registered.  That would indicate a bug
in qemu, which is hardware as far as the guest is concerned.

> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;

Not sure what this case represents.

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;

Or this.

> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	long idx, ret = H_HARDWARE;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> +
> +	/* Clear TCE */
> +	if (dir == DMA_NONE) {
> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> +			return H_PARAMETER;
> +
> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
> +	}
> +
> +	/* Put TCE */
> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> +		return H_PARAMETER;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long ioba,
> +		u64 __user *tces, unsigned long npages)
> +{
> +	unsigned long i, ret, tce, gpa;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	for (i = 0; i < npages; ++i) {
> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);

IIUC this is the virtual mode, not the real mode version.  In which
case you shouldn't be accessing tces[i] (a userspace pointeR) directly
bit should instead be using get_user().

> +		if (iommu_tce_put_param_check(tbl, ioba +
> +				(i << tbl->it_page_shift), gpa))
> +			return H_PARAMETER;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(tces[i]);
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
> +				iommu_tce_direction(tce));
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	unsigned long i;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
> +
> +	return H_SUCCESS;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {

As noted above, AFAICT there is one stit per group, rather than per
backend IOMMU table, so if there are multiple groups in the same
container (and therefore attached to the same LIOBN), won't this mean
we duplicate this operation a bunch of times?

> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long entry, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
> +				stit->tbl, ioba, tces, npages);
> +		if (ret != H_SUCCESS)
> +			goto unlock_exit;

Hmm, I don't suppose you could simplify things by not having a
put_tce_indirect() version of the whole backend iommu mapping
function, but just a single-TCE version, and instead looping across
the backend IOMMU tables as you put each indirect entry in .

> +	}
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
> +				tce_value, npages);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 8a6834e6e1c8..4d6f01712a6d 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>  }
>  
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_SUCCESS;

What case is this?  Not being able to find the userspace duesn't sound
like a success.

> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_SUCCESS;

And again..

> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
> +	if (!mem)
> +		return H_HARDWARE;

Should this have a WARN_ON?

> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_HARDWARE;

Again H_HARDWARE doesn't seem right here.

> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long liobn,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> +
> +	/* Clear TCE */
> +	if (dir == DMA_NONE) {
> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> +			return H_PARAMETER;
> +
> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
> +	}
> +
> +	/* Put TCE */
> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> +		return H_PARAMETER;
> +
> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
> +}
> +
> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long ioba,
> +		u64 *tces, unsigned long npages)
> +{
> +	unsigned long i, ret, tce, gpa;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	for (i = 0; i < npages; ++i) {
> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		if (iommu_tce_put_param_check(tbl, ioba +
> +				(i << tbl->it_page_shift), gpa))
> +			return H_PARAMETER;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(tces[i]);
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
> +				iommu_tce_direction(tce));
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	unsigned long i;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
> +
> +	return H_SUCCESS;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
> +				liobn, ioba, tce);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		 * depend on hpt.
>  		 */
>  		struct mm_iommu_table_group_mem_t *mem;
> +		struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>  			return H_TOO_HARD;
> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>  			return H_TOO_HARD;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
> +					stit->tbl, ioba, (u64 *)tces, npages);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
>  	} else {
>  		/*
>  		 * This is emulated devices case.
> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
> +				liobn, ioba, tce_value, npages);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 70963c845e96..0e555ba998c0 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_ALLOC_HTAB:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..3181054c8ff7 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>  	return ret > 0;
>  }
>  
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
> +
>  /*
>   * Groups can use the same or different IOMMU domains.  If the same then
>   * adding a new group may change the coherency of groups we've previously
> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			int group_id;
> +			struct iommu_group *grp;
> +
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			group_id = kvm_vfio_external_user_iommu_id(
> +					kvg->vfio_group);
> +			grp = iommu_group_get_by_id(group_id);
> +			if (!grp) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group, grp);
> +
> +			iommu_group_put(grp);
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}

Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
path to detach the group from all LIOBNs,  Or else just fail if if
there are LIOBNs attached.  I think it would be a qemu bug not to
detach the LIOBNs before removing the group, but we stil need to
protect the host in that case.

>  
>  	return -ENXIO;
> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
  2017-01-11  6:35         ` Alexey Kardashevskiy
@ 2017-01-12  5:49           ` David Gibson
  -1 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2017-01-12  5:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 8434 bytes --]

On Wed, Jan 11, 2017 at 05:35:21PM +1100, Alexey Kardashevskiy wrote:
> On 21/12/16 19:57, Alexey Kardashevskiy wrote:
> > On 21/12/16 15:08, David Gibson wrote:
> >> On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
> >>> VFIO on sPAPR already implements guest memory pre-registration
> >>> when the entire guest RAM gets pinned. This can be used to translate
> >>> the physical address of a guest page containing the TCE list
> >>> from H_PUT_TCE_INDIRECT.
> >>>
> >>> This makes use of the pre-registrered memory API to access TCE list
> >>> pages in order to avoid unnecessary locking on the KVM memory
> >>> reverse map as we know that all of guest memory is pinned and
> >>> we have a flat array mapping GPA to HPA which makes it simpler and
> >>> quicker to index into that array (even with looking up the
> >>> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> >>> lock the rmap entry, look up the user page tables, and unlock the rmap
> >>> entry. Note that the rmap pointer is initialized to NULL
> >>> where declared (not in this patch).
> >>>
> >>> If a requested chunk of memory has not been preregistered,
> >>> this will fail with H_TOO_HARD so the virtual mode handle can
> >>> handle the request.
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>> ---
> >>> Changes:
> >>> v2:
> >>> * updated the commit log with David's comment
> >>> ---
> >>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
> >>>  1 file changed, 49 insertions(+), 16 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>> index d461c440889a..a3be4bd6188f 100644
> >>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>>  
> >>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> >>> +{
> >>> +	return mm_iommu_preregistered(vcpu->kvm->mm);
> >>> +}
> >>> +
> >>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> >>> +{
> >>> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> >>> +}
> >>
> >> I don't see that there's much point to these inlines.
> >>
> >>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>  		unsigned long ioba, unsigned long tce)
> >>>  {
> >>> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>  	if (ret != H_SUCCESS)
> >>>  		return ret;
> >>>  
> >>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>> -		return H_TOO_HARD;
> >>> +	if (kvmppc_preregistered(vcpu)) {
> >>> +		/*
> >>> +		 * We get here if guest memory was pre-registered which
> >>> +		 * is normally VFIO case and gpa->hpa translation does not
> >>> +		 * depend on hpt.
> >>> +		 */
> >>> +		struct mm_iommu_table_group_mem_t *mem;
> >>>  
> >>> -	rmap = (void *) vmalloc_to_phys(rmap);
> >>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>> +			return H_TOO_HARD;
> >>>  
> >>> -	/*
> >>> -	 * Synchronize with the MMU notifier callbacks in
> >>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>> -	 * While we have the rmap lock, code running on other CPUs
> >>> -	 * cannot finish unmapping the host real page that backs
> >>> -	 * this guest real page, so we are OK to access the host
> >>> -	 * real page.
> >>> -	 */
> >>> -	lock_rmap(rmap);
> >>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>> -		ret = H_TOO_HARD;
> >>> -		goto unlock_exit;
> >>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> >>> +			return H_TOO_HARD;
> >>> +	} else {
> >>> +		/*
> >>> +		 * This is emulated devices case.
> >>> +		 * We do not require memory to be preregistered in this case
> >>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> >>> +		 */
> >>
> >> Hmm.  So this isn't wrong as such, but the logic and comments are
> >> both misleading.  The 'if' here isn't really about VFIO vs. emulated -
> >> it's about whether the mm has *any* preregistered chunks, without any
> >> regard to which particular device you're talking about.  For example
> >> if your guest has two PHBs, one with VFIO devices and the other with
> >> emulated devices, then the emulated devices will still go through the
> >> "VFIO" case here.
> > 
> > kvmppc_preregistered() checks for a single pointer, kvmppc_rm_ua_to_hpa()
> > goes through __find_linux_pte_or_hugepte() which is unnecessary
> > complication here.

Except that you're going to call kvmppc_rm_ua_to_hpa() eventually anyway.

> > s/emulated devices case/case of a guest with emulated devices
> only/ ?

Changing that in the comments would help, yes.

> > 
> > 
> >> Really what you have here is a fast case when the tce_list is in
> >> preregistered memory, and a fallback case when it isn't.  But that's
> >> obscured by the fact that if for whatever reason you have some
> >> preregistered memory but it doesn't cover the tce_list, then you don't
> >> go to the fallback case here, but instead fall right back to the
> >> virtual mode handler.
> > 
> > This is purely acceleration, I am trying to make obvious cases faster, and
> > other cases safer. If some chunk is not preregistered but others are and
> > there is H_PUT_TCE_INDIRECT with tce_list from non-preregistered memory,
> > then I have no idea what this userspace is and what it is doing, so I just
> > do not want to accelerate anything for it in real mode (I have poor
> > imagination and since I cannot test it - I better drop it).

You have this all backwards.  The kernel SHOULD NOT have a
preconceived idea of what userspace is and how it's using the kernel
facilities.  It should just provide the kernel interfaces with their
defined semantics, and userspace can use them however it wants.

This is a frequent cause of problems in your patches: you base
comments and code around what you imagine to be the usage model in
userspace.  This makes the comments misleading, and the code fragile
against future changes in use cases.  Userspace and the kernel should
always be isolated from each other by a well defined API, and not go
making assumptions about each other's behaviour beyond those defined
API semantics.

> >> So, I think you should either:
> >>     1) Fallback to the code below whenever you can't access the
> >>        tce_list via prereg memory, regardless of whether there's any
> >>        _other_ prereg memory
> > 
> > Using prereg was the entire point here as if something goes wrong (i.e.
> > some TCE fails to translate), I may stop in a middle of TCE list and will
> > have to do complicated rollback to pass the request in the virtual mode to
> > finish processing (note that there is nothing to revert when it is
> > emulated-devices-only-guest).

But you *still* have that problem with the code above.  If the
userspace has preregistered memory chunks you still won't know until
you look closer whether the indirect table specifically is covered by
the prereg region.  You lose nothing by checking *that* and falling
back to the slow path if the prereg lookup fails.

> >> or
> >>     2) Drop the code below entirely and always return H_TOO_HARD if
> >>        you can't get the tce_list from prereg.
> > 
> > This path cannot fail for emulated device and it is really fast path, why
> > to drop it?

Because it makes the RM code simpler.  If dropping this fallback from
RM is too much of a performance hit, then go for option (1) instead.

> > I am _really_ confused now. In few last respins this was not a concern, now
> > it is - is the patch this bad and 100% needs to be reworked? I am trying to
> > push it last 3 years now :(

3 years, and yet it still has muddled concepts.  Postings have been
infrequent enough that I do tend to forget my context from one ronud
to the next.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
@ 2017-01-12  5:49           ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2017-01-12  5:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 8434 bytes --]

On Wed, Jan 11, 2017 at 05:35:21PM +1100, Alexey Kardashevskiy wrote:
> On 21/12/16 19:57, Alexey Kardashevskiy wrote:
> > On 21/12/16 15:08, David Gibson wrote:
> >> On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
> >>> VFIO on sPAPR already implements guest memory pre-registration
> >>> when the entire guest RAM gets pinned. This can be used to translate
> >>> the physical address of a guest page containing the TCE list
> >>> from H_PUT_TCE_INDIRECT.
> >>>
> >>> This makes use of the pre-registrered memory API to access TCE list
> >>> pages in order to avoid unnecessary locking on the KVM memory
> >>> reverse map as we know that all of guest memory is pinned and
> >>> we have a flat array mapping GPA to HPA which makes it simpler and
> >>> quicker to index into that array (even with looking up the
> >>> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> >>> lock the rmap entry, look up the user page tables, and unlock the rmap
> >>> entry. Note that the rmap pointer is initialized to NULL
> >>> where declared (not in this patch).
> >>>
> >>> If a requested chunk of memory has not been preregistered,
> >>> this will fail with H_TOO_HARD so the virtual mode handle can
> >>> handle the request.
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>> ---
> >>> Changes:
> >>> v2:
> >>> * updated the commit log with David's comment
> >>> ---
> >>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
> >>>  1 file changed, 49 insertions(+), 16 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>> index d461c440889a..a3be4bd6188f 100644
> >>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>>  
> >>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> >>> +{
> >>> +	return mm_iommu_preregistered(vcpu->kvm->mm);
> >>> +}
> >>> +
> >>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> >>> +{
> >>> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> >>> +}
> >>
> >> I don't see that there's much point to these inlines.
> >>
> >>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>  		unsigned long ioba, unsigned long tce)
> >>>  {
> >>> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>  	if (ret != H_SUCCESS)
> >>>  		return ret;
> >>>  
> >>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>> -		return H_TOO_HARD;
> >>> +	if (kvmppc_preregistered(vcpu)) {
> >>> +		/*
> >>> +		 * We get here if guest memory was pre-registered which
> >>> +		 * is normally VFIO case and gpa->hpa translation does not
> >>> +		 * depend on hpt.
> >>> +		 */
> >>> +		struct mm_iommu_table_group_mem_t *mem;
> >>>  
> >>> -	rmap = (void *) vmalloc_to_phys(rmap);
> >>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>> +			return H_TOO_HARD;
> >>>  
> >>> -	/*
> >>> -	 * Synchronize with the MMU notifier callbacks in
> >>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>> -	 * While we have the rmap lock, code running on other CPUs
> >>> -	 * cannot finish unmapping the host real page that backs
> >>> -	 * this guest real page, so we are OK to access the host
> >>> -	 * real page.
> >>> -	 */
> >>> -	lock_rmap(rmap);
> >>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>> -		ret = H_TOO_HARD;
> >>> -		goto unlock_exit;
> >>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> >>> +			return H_TOO_HARD;
> >>> +	} else {
> >>> +		/*
> >>> +		 * This is emulated devices case.
> >>> +		 * We do not require memory to be preregistered in this case
> >>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> >>> +		 */
> >>
> >> Hmm.  So this isn't wrong as such, but the logic and comments are
> >> both misleading.  The 'if' here isn't really about VFIO vs. emulated -
> >> it's about whether the mm has *any* preregistered chunks, without any
> >> regard to which particular device you're talking about.  For example
> >> if your guest has two PHBs, one with VFIO devices and the other with
> >> emulated devices, then the emulated devices will still go through the
> >> "VFIO" case here.
> > 
> > kvmppc_preregistered() checks for a single pointer, kvmppc_rm_ua_to_hpa()
> > goes through __find_linux_pte_or_hugepte() which is unnecessary
> > complication here.

Except that you're going to call kvmppc_rm_ua_to_hpa() eventually anyway.

> > s/emulated devices case/case of a guest with emulated devices
> only/ ?

Changing that in the comments would help, yes.

> > 
> > 
> >> Really what you have here is a fast case when the tce_list is in
> >> preregistered memory, and a fallback case when it isn't.  But that's
> >> obscured by the fact that if for whatever reason you have some
> >> preregistered memory but it doesn't cover the tce_list, then you don't
> >> go to the fallback case here, but instead fall right back to the
> >> virtual mode handler.
> > 
> > This is purely acceleration, I am trying to make obvious cases faster, and
> > other cases safer. If some chunk is not preregistered but others are and
> > there is H_PUT_TCE_INDIRECT with tce_list from non-preregistered memory,
> > then I have no idea what this userspace is and what it is doing, so I just
> > do not want to accelerate anything for it in real mode (I have poor
> > imagination and since I cannot test it - I better drop it).

You have this all backwards.  The kernel SHOULD NOT have a
preconceived idea of what userspace is and how it's using the kernel
facilities.  It should just provide the kernel interfaces with their
defined semantics, and userspace can use them however it wants.

This is a frequent cause of problems in your patches: you base
comments and code around what you imagine to be the usage model in
userspace.  This makes the comments misleading, and the code fragile
against future changes in use cases.  Userspace and the kernel should
always be isolated from each other by a well defined API, and not go
making assumptions about each other's behaviour beyond those defined
API semantics.

> >> So, I think you should either:
> >>     1) Fallback to the code below whenever you can't access the
> >>        tce_list via prereg memory, regardless of whether there's any
> >>        _other_ prereg memory
> > 
> > Using prereg was the entire point here as if something goes wrong (i.e.
> > some TCE fails to translate), I may stop in a middle of TCE list and will
> > have to do complicated rollback to pass the request in the virtual mode to
> > finish processing (note that there is nothing to revert when it is
> > emulated-devices-only-guest).

But you *still* have that problem with the code above.  If the
userspace has preregistered memory chunks you still won't know until
you look closer whether the indirect table specifically is covered by
the prereg region.  You lose nothing by checking *that* and falling
back to the slow path if the prereg lookup fails.

> >> or
> >>     2) Drop the code below entirely and always return H_TOO_HARD if
> >>        you can't get the tce_list from prereg.
> > 
> > This path cannot fail for emulated device and it is really fast path, why
> > to drop it?

Because it makes the RM code simpler.  If dropping this fallback from
RM is too much of a performance hit, then go for option (1) instead.

> > I am _really_ confused now. In few last respins this was not a concern, now
> > it is - is the patch this bad and 100% needs to be reworked? I am trying to
> > push it last 3 years now :(

3 years, and yet it still has muddled concepts.  Postings have been
infrequent enough that I do tend to forget my context from one ronud
to the next.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
  2017-01-12  5:04       ` David Gibson
@ 2017-01-12  8:09         ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2017-01-12  8:09 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson


[-- Attachment #1.1: Type: text/plain, Size: 33868 bytes --]

On 12/01/17 16:04, David Gibson wrote:
> On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN, this is done to simplify the cleanup and can be
>> improved later.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed later
>> ---
>>
>> This obsoletes:
>>
>> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
>> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
>> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
>>
>>
>> So I have not reposted the whole thing, should have I?
>>
>>
>> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
>>
>>
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            |  88 +++++++++
>>  8 files changed, 594 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__u32	flags;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +	@flags are not supported now, must be zero;
>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 28350a294b1e..3d281b7ea369 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct vfio_group *group;
>> +	struct iommu_table *tbl;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 0a21c8503974..936138b866e7 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group, struct iommu_group *grp);
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 810f74317987..4088da4a575f 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +};
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 15df8ae627d9..008c4aee4df6 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,10 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>> +#include <linux/file.h>
>> +#include <linux/vfio.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -39,6 +43,20 @@
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>> +
>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>> +{
>> +	void (*fn)(struct vfio_group *);
>> +
>> +	fn = symbol_get(vfio_group_put_external_user);
>> +	if (!fn)
> 
> I think this should have a WARN_ON().  If the vfio module is gone
> while you still have VFIO groups attached to a KVM table, something
> has gone horribly wrong.
> 
>> +		return;
>> +
>> +	fn(vfio_group);
>> +
>> +	symbol_put(vfio_group_put_external_user);
>> +}
>>  
>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>  {
>> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>> +
>> +	kfree(stit);
>> +}
>> +
>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>> +		struct kvmppc_spapr_tce_table *stt,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>> +
>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>> +		if (group && (stit->group != group))
>> +			continue;
>> +
>> +		list_del_rcu(&stit->next);
>> +
>> +		iommu_table_put(stit->tbl);
>> +		kvm_vfio_group_put_external_user(stit->group);
>> +
>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>> +	}
>> +}
>> +
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>> +}
>> +
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group, struct iommu_group *grp)
> 
> Isn't passing both the vfio_group and the iommu_group redundant?


vfio_group struct is internal to the vfio driver and there is no API to get
the iommu_group pointer from it.



> 
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	bool found = false;
>> +	struct iommu_table *tbl = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	struct fd f;
>> +
>> +	f = fdget(tablefd);
>> +	if (!f.file)
>> +		return -EBADF;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt == f.file->private_data) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	fdput(f);
>> +
>> +	if (!found)
>> +		return -ENODEV;
> 
> Not entirely sure if ENODEV is the right error, but I can't
> immediately think of a better one.
> 
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (!table_group)
>> +		return -EFAULT;
> 
> EFAULT is usually only returned when you pass a syscall a bad pointer,
> which doesn't look to be the case here.  What situation does this
> error path actually represent?


"something went terribly wrong".


> 
>> +
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbltmp = table_group->tables[i];
>> +
>> +		if (!tbltmp)
>> +			continue;
>> +
>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>> +				(tbltmp->it_offset == stt->offset)) {
>> +			tbl = tbltmp;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbl)
>> +		return -ENODEV;
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>> +	stit->tbl = tbl;
>> +	stit->group = group;
>> +
>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> 
> Won't this add a separate stit entry for each group attached to the
> LIOBN, even if those groups share a single hardware iommu table -
> which is the likely case if those groups have all been put into the
> same container.


Correct. I am planning on optimizing this later.



> 
>> +	return 0;
>> +}
>> +
>>  static void release_spapr_tce_table(struct rcu_head *head)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>  
>>  	list_del_rcu(&stt->list);
>>  
>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>> +
>>  	kvm_put_kvm(stt->kvm);
>>  
>>  	kvmppc_account_memlimit(
>> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	stt->offset = args->offset;
>>  	stt->size = size;
>>  	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>  
>>  	for (i = 0; i < npages; i++) {
>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_HARDWARE;
> 
> IIUC, this error represents trying to unmap a page from the vIOMMU,
> and discovering that it wasn't preregistered in the first place, which
> shouldn't happen.  So would a WARN_ON() make sense here as well as the
> H_HARDWARE.
>
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +
>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_HARDWARE;
> 
> This would represent the guest trying to map a mad GPA, yes?  In which
> case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.
> 
>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_HARDWARE;
> 
> Here H_HARDWARE seems right. IIUC this represents the guest trying to
> map an address which wasn't pre-registered.  That would indicate a bug
> in qemu, which is hardware as far as the guest is concerned.
> 
>> +
>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>> +		return H_HARDWARE;
> 
> Not sure what this case represents.

Preregistered memory not being able to translate userspace address to a
host physical. In virtual mode it is a simple bounds checker, in real mode
it also includes vmalloc_to_phys() failure.


> 
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
> 
> Or this.

A preregistered memory area is in a process of disposal, no new mappings
are allowed.


> 
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce)
>> +{
>> +	long idx, ret = H_HARDWARE;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
>> +
>> +	/* Clear TCE */
>> +	if (dir == DMA_NONE) {
>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
>> +			return H_PARAMETER;
>> +
>> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
>> +	}
>> +
>> +	/* Put TCE */
>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
>> +		return H_PARAMETER;
>> +
>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +
>> +	return ret;
>> +}
>> +
>> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long ioba,
>> +		u64 __user *tces, unsigned long npages)
>> +{
>> +	unsigned long i, ret, tce, gpa;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> 
> IIUC this is the virtual mode, not the real mode version.  In which
> case you shouldn't be accessing tces[i] (a userspace pointeR) directly
> bit should instead be using get_user().
> 
>> +		if (iommu_tce_put_param_check(tbl, ioba +
>> +				(i << tbl->it_page_shift), gpa))
>> +			return H_PARAMETER;
>> +	}
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		tce = be64_to_cpu(tces[i]);
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
>> +				iommu_tce_direction(tce));
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	unsigned long i;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +
>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i)
>> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		      unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> 
> As noted above, AFAICT there is one stit per group, rather than per
> backend IOMMU table, so if there are multiple groups in the same
> container (and therefore attached to the same LIOBN), won't this mean
> we duplicate this operation a bunch of times?
> 
>> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	unsigned long entry, ua = 0;
>>  	u64 __user *tces;
>>  	u64 tce;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  	tces = (u64 __user *) ua;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
>> +				stit->tbl, ioba, tces, npages);
>> +		if (ret != H_SUCCESS)
>> +			goto unlock_exit;
> 
> Hmm, I don't suppose you could simplify things by not having a
> put_tce_indirect() version of the whole backend iommu mapping
> function, but just a single-TCE version, and instead looping across
> the backend IOMMU tables as you put each indirect entry in .
> 
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i) {
>>  		if (get_user(tce, tces + i)) {
>>  			ret = H_TOO_HARD;
>> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
>> +				tce_value, npages);
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 8a6834e6e1c8..4d6f01712a6d 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>>  }
>>  
>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_SUCCESS;
> 
> What case is this?  Not being able to find the userspace duesn't sound
> like a success.
> 
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_SUCCESS;
> 
> And again..
> 
>> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
>> +	if (!mem)
>> +		return H_HARDWARE;
> 
> Should this have a WARN_ON?
> 
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +
>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
>> +}
>> +
>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa = 0, ua;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
>> +		return H_HARDWARE;
> 
> Again H_HARDWARE doesn't seem right here.
> 
>> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
>> +
>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
>> +
>> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long liobn,
>> +		unsigned long ioba, unsigned long tce)
>> +{
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
>> +
>> +	/* Clear TCE */
>> +	if (dir == DMA_NONE) {
>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
>> +			return H_PARAMETER;
>> +
>> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
>> +	}
>> +
>> +	/* Put TCE */
>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
>> +		return H_PARAMETER;
>> +
>> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
>> +}
>> +
>> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long ioba,
>> +		u64 *tces, unsigned long npages)
>> +{
>> +	unsigned long i, ret, tce, gpa;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +		if (iommu_tce_put_param_check(tbl, ioba +
>> +				(i << tbl->it_page_shift), gpa))
>> +			return H_PARAMETER;
>> +	}
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		tce = be64_to_cpu(tces[i]);
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
>> +				iommu_tce_direction(tce));
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	unsigned long i;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +
>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i)
>> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
>> +				liobn, ioba, tce);
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		 * depend on hpt.
>>  		 */
>>  		struct mm_iommu_table_group_mem_t *mem;
>> +		struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>  			return H_TOO_HARD;
>> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>>  			return H_TOO_HARD;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
>> +					stit->tbl, ioba, (u64 *)tces, npages);
>> +			if (ret != H_SUCCESS)
>> +				return ret;
>> +		}
>>  	} else {
>>  		/*
>>  		 * This is emulated devices case.
>> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
>> +				liobn, ioba, tce_value, npages);
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 70963c845e96..0e555ba998c0 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  	case KVM_CAP_SPAPR_TCE:
>>  	case KVM_CAP_SPAPR_TCE_64:
>> +		/* fallthrough */
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>  	case KVM_CAP_PPC_RTAS:
>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index d32f239eb471..3181054c8ff7 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>  	return ret > 0;
>>  }
>>  
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>> +
>>  /*
>>   * Groups can use the same or different IOMMU domains.  If the same then
>>   * adding a new group may change the coherency of groups we've previously
>> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  
>>  		mutex_unlock(&kv->lock);
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>  
>>  		kvm_vfio_group_put_external_user(vfio_group);
>> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  		kvm_vfio_update_coherency(dev);
>>  
>>  		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>> +		struct kvm_vfio_spapr_tce param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz || param.flags)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.groupfd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
>> +		ret = -ENOENT;
>> +
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			int group_id;
>> +			struct iommu_group *grp;
>> +
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			group_id = kvm_vfio_external_user_iommu_id(
>> +					kvg->vfio_group);
>> +			grp = iommu_group_get_by_id(group_id);
>> +			if (!grp) {
>> +				ret = -EFAULT;
>> +				break;
>> +			}
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.tablefd, vfio_group, grp);
>> +
>> +			iommu_group_put(grp);
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>  	}
> 
> Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
> path to detach the group from all LIOBNs,  Or else just fail if if
> there are LIOBNs attached.  I think it would be a qemu bug not to
> detach the LIOBNs before removing the group, but we stil need to
> protect the host in that case.


Yeah, this bit is a bit tricky/ugly.

kvm_spapr_tce_liobn_release_iommu_group() (called from
kvm_spapr_tce_fops::release()) drops references when a group is removed
from the VFIO KVM device so there is no KVM_DEV_VFIO_GROUP_UNSET_SPAPR_TCE
and no action from QEMU is required.



>>  
>>  	return -ENXIO;
>> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>  		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>> +#endif
>>  			return 0;
>>  		}
>>  
>> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>  	struct kvm_vfio_group *kvg, *tmp;
>>  
>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>  		list_del(&kvg->node);
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2017-01-12  8:09         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2017-01-12  8:09 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson


[-- Attachment #1.1: Type: text/plain, Size: 33868 bytes --]

On 12/01/17 16:04, David Gibson wrote:
> On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN, this is done to simplify the cleanup and can be
>> improved later.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed later
>> ---
>>
>> This obsoletes:
>>
>> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
>> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
>> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
>>
>>
>> So I have not reposted the whole thing, should have I?
>>
>>
>> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
>>
>>
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            |  88 +++++++++
>>  8 files changed, 594 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__u32	flags;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +	@flags are not supported now, must be zero;
>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 28350a294b1e..3d281b7ea369 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct vfio_group *group;
>> +	struct iommu_table *tbl;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 0a21c8503974..936138b866e7 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group, struct iommu_group *grp);
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 810f74317987..4088da4a575f 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +};
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 15df8ae627d9..008c4aee4df6 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,10 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>> +#include <linux/file.h>
>> +#include <linux/vfio.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -39,6 +43,20 @@
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>> +
>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>> +{
>> +	void (*fn)(struct vfio_group *);
>> +
>> +	fn = symbol_get(vfio_group_put_external_user);
>> +	if (!fn)
> 
> I think this should have a WARN_ON().  If the vfio module is gone
> while you still have VFIO groups attached to a KVM table, something
> has gone horribly wrong.
> 
>> +		return;
>> +
>> +	fn(vfio_group);
>> +
>> +	symbol_put(vfio_group_put_external_user);
>> +}
>>  
>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>  {
>> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>> +
>> +	kfree(stit);
>> +}
>> +
>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>> +		struct kvmppc_spapr_tce_table *stt,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>> +
>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>> +		if (group && (stit->group != group))
>> +			continue;
>> +
>> +		list_del_rcu(&stit->next);
>> +
>> +		iommu_table_put(stit->tbl);
>> +		kvm_vfio_group_put_external_user(stit->group);
>> +
>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>> +	}
>> +}
>> +
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>> +}
>> +
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group, struct iommu_group *grp)
> 
> Isn't passing both the vfio_group and the iommu_group redundant?


vfio_group struct is internal to the vfio driver and there is no API to get
the iommu_group pointer from it.



> 
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	bool found = false;
>> +	struct iommu_table *tbl = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	struct fd f;
>> +
>> +	f = fdget(tablefd);
>> +	if (!f.file)
>> +		return -EBADF;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt == f.file->private_data) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	fdput(f);
>> +
>> +	if (!found)
>> +		return -ENODEV;
> 
> Not entirely sure if ENODEV is the right error, but I can't
> immediately think of a better one.
> 
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (!table_group)
>> +		return -EFAULT;
> 
> EFAULT is usually only returned when you pass a syscall a bad pointer,
> which doesn't look to be the case here.  What situation does this
> error path actually represent?


"something went terribly wrong".


> 
>> +
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbltmp = table_group->tables[i];
>> +
>> +		if (!tbltmp)
>> +			continue;
>> +
>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>> +				(tbltmp->it_offset == stt->offset)) {
>> +			tbl = tbltmp;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbl)
>> +		return -ENODEV;
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>> +	stit->tbl = tbl;
>> +	stit->group = group;
>> +
>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> 
> Won't this add a separate stit entry for each group attached to the
> LIOBN, even if those groups share a single hardware iommu table -
> which is the likely case if those groups have all been put into the
> same container.


Correct. I am planning on optimizing this later.



> 
>> +	return 0;
>> +}
>> +
>>  static void release_spapr_tce_table(struct rcu_head *head)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>  
>>  	list_del_rcu(&stt->list);
>>  
>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>> +
>>  	kvm_put_kvm(stt->kvm);
>>  
>>  	kvmppc_account_memlimit(
>> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	stt->offset = args->offset;
>>  	stt->size = size;
>>  	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>  
>>  	for (i = 0; i < npages; i++) {
>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_HARDWARE;
> 
> IIUC, this error represents trying to unmap a page from the vIOMMU,
> and discovering that it wasn't preregistered in the first place, which
> shouldn't happen.  So would a WARN_ON() make sense here as well as the
> H_HARDWARE.
>
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +
>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_HARDWARE;
> 
> This would represent the guest trying to map a mad GPA, yes?  In which
> case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.
> 
>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_HARDWARE;
> 
> Here H_HARDWARE seems right. IIUC this represents the guest trying to
> map an address which wasn't pre-registered.  That would indicate a bug
> in qemu, which is hardware as far as the guest is concerned.
> 
>> +
>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>> +		return H_HARDWARE;
> 
> Not sure what this case represents.

Preregistered memory not being able to translate userspace address to a
host physical. In virtual mode it is a simple bounds checker, in real mode
it also includes vmalloc_to_phys() failure.


> 
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
> 
> Or this.

A preregistered memory area is in a process of disposal, no new mappings
are allowed.


> 
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce)
>> +{
>> +	long idx, ret = H_HARDWARE;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
>> +
>> +	/* Clear TCE */
>> +	if (dir == DMA_NONE) {
>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
>> +			return H_PARAMETER;
>> +
>> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
>> +	}
>> +
>> +	/* Put TCE */
>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
>> +		return H_PARAMETER;
>> +
>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +
>> +	return ret;
>> +}
>> +
>> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long ioba,
>> +		u64 __user *tces, unsigned long npages)
>> +{
>> +	unsigned long i, ret, tce, gpa;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> 
> IIUC this is the virtual mode, not the real mode version.  In which
> case you shouldn't be accessing tces[i] (a userspace pointeR) directly
> bit should instead be using get_user().
> 
>> +		if (iommu_tce_put_param_check(tbl, ioba +
>> +				(i << tbl->it_page_shift), gpa))
>> +			return H_PARAMETER;
>> +	}
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		tce = be64_to_cpu(tces[i]);
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
>> +				iommu_tce_direction(tce));
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	unsigned long i;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +
>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i)
>> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		      unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> 
> As noted above, AFAICT there is one stit per group, rather than per
> backend IOMMU table, so if there are multiple groups in the same
> container (and therefore attached to the same LIOBN), won't this mean
> we duplicate this operation a bunch of times?
> 
>> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	unsigned long entry, ua = 0;
>>  	u64 __user *tces;
>>  	u64 tce;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  	tces = (u64 __user *) ua;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
>> +				stit->tbl, ioba, tces, npages);
>> +		if (ret != H_SUCCESS)
>> +			goto unlock_exit;
> 
> Hmm, I don't suppose you could simplify things by not having a
> put_tce_indirect() version of the whole backend iommu mapping
> function, but just a single-TCE version, and instead looping across
> the backend IOMMU tables as you put each indirect entry in .
> 
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i) {
>>  		if (get_user(tce, tces + i)) {
>>  			ret = H_TOO_HARD;
>> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
>> +				tce_value, npages);
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 8a6834e6e1c8..4d6f01712a6d 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>>  }
>>  
>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_SUCCESS;
> 
> What case is this?  Not being able to find the userspace duesn't sound
> like a success.
> 
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_SUCCESS;
> 
> And again..
> 
>> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
>> +	if (!mem)
>> +		return H_HARDWARE;
> 
> Should this have a WARN_ON?
> 
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +
>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
>> +}
>> +
>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa = 0, ua;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
>> +		return H_HARDWARE;
> 
> Again H_HARDWARE doesn't seem right here.
> 
>> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
>> +
>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
>> +
>> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long liobn,
>> +		unsigned long ioba, unsigned long tce)
>> +{
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
>> +
>> +	/* Clear TCE */
>> +	if (dir == DMA_NONE) {
>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
>> +			return H_PARAMETER;
>> +
>> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
>> +	}
>> +
>> +	/* Put TCE */
>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
>> +		return H_PARAMETER;
>> +
>> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
>> +}
>> +
>> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl, unsigned long ioba,
>> +		u64 *tces, unsigned long npages)
>> +{
>> +	unsigned long i, ret, tce, gpa;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +		if (iommu_tce_put_param_check(tbl, ioba +
>> +				(i << tbl->it_page_shift), gpa))
>> +			return H_PARAMETER;
>> +	}
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		tce = be64_to_cpu(tces[i]);
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
>> +				iommu_tce_direction(tce));
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
>> +		struct iommu_table *tbl,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	unsigned long i;
>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>> +
>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i)
>> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
>> +				liobn, ioba, tce);
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		 * depend on hpt.
>>  		 */
>>  		struct mm_iommu_table_group_mem_t *mem;
>> +		struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>  			return H_TOO_HARD;
>> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>>  			return H_TOO_HARD;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
>> +					stit->tbl, ioba, (u64 *)tces, npages);
>> +			if (ret != H_SUCCESS)
>> +				return ret;
>> +		}
>>  	} else {
>>  		/*
>>  		 * This is emulated devices case.
>> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
>> +				liobn, ioba, tce_value, npages);
>> +		if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 70963c845e96..0e555ba998c0 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  	case KVM_CAP_SPAPR_TCE:
>>  	case KVM_CAP_SPAPR_TCE_64:
>> +		/* fallthrough */
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>  	case KVM_CAP_PPC_RTAS:
>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index d32f239eb471..3181054c8ff7 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>  	return ret > 0;
>>  }
>>  
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>> +
>>  /*
>>   * Groups can use the same or different IOMMU domains.  If the same then
>>   * adding a new group may change the coherency of groups we've previously
>> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  
>>  		mutex_unlock(&kv->lock);
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>  
>>  		kvm_vfio_group_put_external_user(vfio_group);
>> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  		kvm_vfio_update_coherency(dev);
>>  
>>  		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>> +		struct kvm_vfio_spapr_tce param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz || param.flags)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.groupfd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
>> +		ret = -ENOENT;
>> +
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			int group_id;
>> +			struct iommu_group *grp;
>> +
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			group_id = kvm_vfio_external_user_iommu_id(
>> +					kvg->vfio_group);
>> +			grp = iommu_group_get_by_id(group_id);
>> +			if (!grp) {
>> +				ret = -EFAULT;
>> +				break;
>> +			}
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.tablefd, vfio_group, grp);
>> +
>> +			iommu_group_put(grp);
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>  	}
> 
> Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
> path to detach the group from all LIOBNs,  Or else just fail if if
> there are LIOBNs attached.  I think it would be a qemu bug not to
> detach the LIOBNs before removing the group, but we stil need to
> protect the host in that case.


Yeah, this bit is a bit tricky/ugly.

kvm_spapr_tce_liobn_release_iommu_group() (called from
kvm_spapr_tce_fops::release()) drops references when a group is removed
from the VFIO KVM device so there is no KVM_DEV_VFIO_GROUP_UNSET_SPAPR_TCE
and no action from QEMU is required.



>>  
>>  	return -ENXIO;
>> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>  		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>> +#endif
>>  			return 0;
>>  		}
>>  
>> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>  	struct kvm_vfio_group *kvg, *tmp;
>>  
>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>  		list_del(&kvg->node);
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
  2017-01-12  8:09         ` Alexey Kardashevskiy
@ 2017-01-12 23:53           ` David Gibson
  -1 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2017-01-12 23:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 36945 bytes --]

On Thu, Jan 12, 2017 at 07:09:01PM +1100, Alexey Kardashevskiy wrote:
> On 12/01/17 16:04, David Gibson wrote:
> > On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN, this is done to simplify the cleanup and can be
> >> improved later.
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>
> >> This obsoletes:
> >>
> >> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
> >> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
> >> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
> >>
> >>
> >> So I have not reposted the whole thing, should have I?
> >>
> >>
> >> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
> >>
> >>
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  88 +++++++++
> >>  8 files changed, 594 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +	@flags are not supported now, must be zero;
> >> +	@groupfd is a file descriptor for a VFIO group;
> >> +	@tablefd is a file descriptor for a TCE table allocated via
> >> +		KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 28350a294b1e..3d281b7ea369 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +	struct rcu_head rcu;
> >> +	struct list_head next;
> >> +	struct vfio_group *group;
> >> +	struct iommu_table *tbl;
> >> +};
> >> +
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >> +	struct list_head iommu_tables;
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index 0a21c8503974..936138b866e7 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group, struct iommu_group *grp);
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 810f74317987..4088da4a575f 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>  
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> +struct kvm_vfio_spapr_tce {
> >> +	__u32	argsz;
> >> +	__u32	flags;
> >> +	__s32	groupfd;
> >> +	__s32	tablefd;
> >> +};
> >> +
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 15df8ae627d9..008c4aee4df6 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,10 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/file.h>
> >> +#include <linux/vfio.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -39,6 +43,20 @@
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >>  #include <asm/tce.h>
> >> +#include <asm/mmu_context.h>
> >> +
> >> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >> +{
> >> +	void (*fn)(struct vfio_group *);
> >> +
> >> +	fn = symbol_get(vfio_group_put_external_user);
> >> +	if (!fn)
> > 
> > I think this should have a WARN_ON().  If the vfio module is gone
> > while you still have VFIO groups attached to a KVM table, something
> > has gone horribly wrong.
> > 
> >> +		return;
> >> +
> >> +	fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_group_put_external_user);
> >> +}
> >>  
> >>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >>  {
> >> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >>  	return ret;
> >>  }
> >>  
> >> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> >> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> >> +
> >> +	kfree(stit);
> >> +}
> >> +
> >> +static void kvm_spapr_tce_liobn_release_iommu_group(
> >> +		struct kvmppc_spapr_tce_table *stt,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> >> +
> >> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> >> +		if (group && (stit->group != group))
> >> +			continue;
> >> +
> >> +		list_del_rcu(&stit->next);
> >> +
> >> +		iommu_table_put(stit->tbl);
> >> +		kvm_vfio_group_put_external_user(stit->group);
> >> +
> >> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> >> +	}
> >> +}
> >> +
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt;
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> >> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> >> +}
> >> +
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group, struct iommu_group *grp)
> > 
> > Isn't passing both the vfio_group and the iommu_group redundant?
> 
> vfio_group struct is internal to the vfio driver and there is no API to get
> the iommu_group pointer from it.

But in the caller you *do* derive the group from the vfio group by
going via id (ugly, but workable I guess).  Why not fold that logic
into this function.

> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt = NULL;
> >> +	bool found = false;
> >> +	struct iommu_table *tbl = NULL;
> >> +	struct iommu_table_group *table_group;
> >> +	long i;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	struct fd f;
> >> +
> >> +	f = fdget(tablefd);
> >> +	if (!f.file)
> >> +		return -EBADF;
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >> +		if (stt == f.file->private_data) {
> >> +			found = true;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	fdput(f);
> >> +
> >> +	if (!found)
> >> +		return -ENODEV;
> > 
> > Not entirely sure if ENODEV is the right error, but I can't
> > immediately think of a better one.
> > 
> >> +	table_group = iommu_group_get_iommudata(grp);
> >> +	if (!table_group)
> >> +		return -EFAULT;
> > 
> > EFAULT is usually only returned when you pass a syscall a bad pointer,
> > which doesn't look to be the case here.  What situation does this
> > error path actually represent?
> 
> 
> "something went terribly wrong".

In that case there should be a WARN_ON().  As long as it's something
terribly wrong that can't be the user's fault.

> >> +
> >> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >> +		struct iommu_table *tbltmp = table_group->tables[i];
> >> +
> >> +		if (!tbltmp)
> >> +			continue;
> >> +
> >> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> >> +				(tbltmp->it_offset == stt->offset)) {
> >> +			tbl = tbltmp;
> >> +			break;
> >> +		}
> >> +	}
> >> +	if (!tbl)
> >> +		return -ENODEV;
> >> +
> >> +	iommu_table_get(tbl);
> >> +
> >> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> >> +	stit->tbl = tbl;
> >> +	stit->group = group;
> >> +
> >> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> > 
> > Won't this add a separate stit entry for each group attached to the
> > LIOBN, even if those groups share a single hardware iommu table -
> > which is the likely case if those groups have all been put into the
> > same container.
> 
> 
> Correct. I am planning on optimizing this later.

Hmm, ok.

> >> +	return 0;
> >> +}
> >> +
> >>  static void release_spapr_tce_table(struct rcu_head *head)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>  
> >>  	list_del_rcu(&stt->list);
> >>  
> >> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> >> +
> >>  	kvm_put_kvm(stt->kvm);
> >>  
> >>  	kvmppc_account_memlimit(
> >> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	stt->offset = args->offset;
> >>  	stt->size = size;
> >>  	stt->kvm = kvm;
> >> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >>  
> >>  	for (i = 0; i < npages; i++) {
> >>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	return ret;
> >>  }
> >>  
> >> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> >> +
> >> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_HARDWARE;
> > 
> > IIUC, this error represents trying to unmap a page from the vIOMMU,
> > and discovering that it wasn't preregistered in the first place, which
> > shouldn't happen.  So would a WARN_ON() make sense here as well as the
> > H_HARDWARE.
> >
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +
> >> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +}
> >> +
> >> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >> +		return H_HARDWARE;
> > 
> > This would represent the guest trying to map a mad GPA, yes?  In which
> > case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.
> > 
> >> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_HARDWARE;
> > 
> > Here H_HARDWARE seems right. IIUC this represents the guest trying to
> > map an address which wasn't pre-registered.  That would indicate a bug
> > in qemu, which is hardware as far as the guest is concerned.
> > 
> >> +
> >> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> > 
> > Not sure what this case represents.
> 
> Preregistered memory not being able to translate userspace address to a
> host physical. In virtual mode it is a simple bounds checker, in real mode
> it also includes vmalloc_to_phys() failure.

Ok.  This caller is virtual mode only, isn't it?  If we fail the
bounds check, that sounds like a WARN_ON() + H_HARDWARE, since it
means we've translated the GPA to an insane UA.

If the translation just fails, that sounds like an H_TOO_HARD.

> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> > 
> > Or this.
> 
> A preregistered memory area is in a process of disposal, no new mappings
> are allowed.

Ok, again under control of qemu, so H_HARDWARE is reasonable.

> >> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl,
> >> +		unsigned long liobn, unsigned long ioba,
> >> +		unsigned long tce)
> >> +{
> >> +	long idx, ret = H_HARDWARE;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> >> +
> >> +	/* Clear TCE */
> >> +	if (dir == DMA_NONE) {
> >> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> >> +			return H_PARAMETER;
> >> +
> >> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
> >> +	}
> >> +
> >> +	/* Put TCE */
> >> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> >> +		return H_PARAMETER;
> >> +
> >> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> >> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
> >> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long ioba,
> >> +		u64 __user *tces, unsigned long npages)
> >> +{
> >> +	unsigned long i, ret, tce, gpa;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> > 
> > IIUC this is the virtual mode, not the real mode version.  In which
> > case you shouldn't be accessing tces[i] (a userspace pointeR) directly
> > bit should instead be using get_user().
> > 
> >> +		if (iommu_tce_put_param_check(tbl, ioba +
> >> +				(i << tbl->it_page_shift), gpa))
> >> +			return H_PARAMETER;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		tce = be64_to_cpu(tces[i]);
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
> >> +				iommu_tce_direction(tce));
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl,
> >> +		unsigned long liobn, unsigned long ioba,
> >> +		unsigned long tce_value, unsigned long npages)
> >> +{
> >> +	unsigned long i;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +
> >> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> >> +		return H_PARAMETER;
> >> +
> >> +	for (i = 0; i < npages; ++i)
> >> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		      unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > 
> > As noted above, AFAICT there is one stit per group, rather than per
> > backend IOMMU table, so if there are multiple groups in the same
> > container (and therefore attached to the same LIOBN), won't this mean
> > we duplicate this operation a bunch of times?
> > 
> >> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	unsigned long entry, ua = 0;
> >>  	u64 __user *tces;
> >>  	u64 tce;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  	tces = (u64 __user *) ua;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
> >> +				stit->tbl, ioba, tces, npages);
> >> +		if (ret != H_SUCCESS)
> >> +			goto unlock_exit;
> > 
> > Hmm, I don't suppose you could simplify things by not having a
> > put_tce_indirect() version of the whole backend iommu mapping
> > function, but just a single-TCE version, and instead looping across
> > the backend IOMMU tables as you put each indirect entry in .
> > 
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i) {
> >>  		if (get_user(tce, tces + i)) {
> >>  			ret = H_TOO_HARD;
> >> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
> >> +				tce_value, npages);
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index 8a6834e6e1c8..4d6f01712a6d 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> >>  }
> >>  
> >> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_SUCCESS;
> > 
> > What case is this?  Not being able to find the userspace duesn't sound
> > like a success.
> > 
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_SUCCESS;
> > 
> > And again..
> > 
> >> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_HARDWARE;
> > 
> > Should this have a WARN_ON?
> > 
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +
> >> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> >> +}
> >> +
> >> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa = 0, ua;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> >> +		return H_HARDWARE;
> > 
> > Again H_HARDWARE doesn't seem right here.
> > 
> >> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_HARDWARE;
> >> +
> >> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> >> +
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> >> +
> >> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> >> +
> >> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long liobn,
> >> +		unsigned long ioba, unsigned long tce)
> >> +{
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> >> +
> >> +	/* Clear TCE */
> >> +	if (dir == DMA_NONE) {
> >> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> >> +			return H_PARAMETER;
> >> +
> >> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
> >> +	}
> >> +
> >> +	/* Put TCE */
> >> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> >> +		return H_PARAMETER;
> >> +
> >> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
> >> +}
> >> +
> >> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long ioba,
> >> +		u64 *tces, unsigned long npages)
> >> +{
> >> +	unsigned long i, ret, tce, gpa;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +		if (iommu_tce_put_param_check(tbl, ioba +
> >> +				(i << tbl->it_page_shift), gpa))
> >> +			return H_PARAMETER;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		tce = be64_to_cpu(tces[i]);
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
> >> +				iommu_tce_direction(tce));
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl,
> >> +		unsigned long liobn, unsigned long ioba,
> >> +		unsigned long tce_value, unsigned long npages)
> >> +{
> >> +	unsigned long i;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +
> >> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> >> +		return H_PARAMETER;
> >> +
> >> +	for (i = 0; i < npages; ++i)
> >> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
> >> +				liobn, ioba, tce);
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		 * depend on hpt.
> >>  		 */
> >>  		struct mm_iommu_table_group_mem_t *mem;
> >> +		struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>  			return H_TOO_HARD;
> >> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> >>  			return H_TOO_HARD;
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
> >> +					stit->tbl, ioba, (u64 *)tces, npages);
> >> +			if (ret != H_SUCCESS)
> >> +				return ret;
> >> +		}
> >>  	} else {
> >>  		/*
> >>  		 * This is emulated devices case.
> >> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
> >> +				liobn, ioba, tce_value, npages);
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >> index 70963c845e96..0e555ba998c0 100644
> >> --- a/arch/powerpc/kvm/powerpc.c
> >> +++ b/arch/powerpc/kvm/powerpc.c
> >> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >> +		/* fallthrough */
> >> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >> index d32f239eb471..3181054c8ff7 100644
> >> --- a/virt/kvm/vfio.c
> >> +++ b/virt/kvm/vfio.c
> >> @@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +#include <asm/kvm_ppc.h>
> >> +#endif
> >> +
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>  	return ret > 0;
> >>  }
> >>  
> >> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >> +{
> >> +	int (*fn)(struct vfio_group *);
> >> +	int ret = -1;
> >> +
> >> +	fn = symbol_get(vfio_external_user_iommu_id);
> >> +	if (!fn)
> >> +		return ret;
> >> +
> >> +	ret = fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_external_user_iommu_id);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  /*
> >>   * Groups can use the same or different IOMMU domains.  If the same then
> >>   * adding a new group may change the coherency of groups we've previously
> >> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  
> >>  		mutex_unlock(&kv->lock);
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >>  
> >>  		kvm_vfio_group_put_external_user(vfio_group);
> >> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>  
> >>  		return ret;
> >> +
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> >> +		struct kvm_vfio_spapr_tce param;
> >> +		unsigned long minsz;
> >> +		struct kvm_vfio *kv = dev->private;
> >> +		struct vfio_group *vfio_group;
> >> +		struct kvm_vfio_group *kvg;
> >> +		struct fd f;
> >> +
> >> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> >> +
> >> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (param.argsz < minsz || param.flags)
> >> +			return -EINVAL;
> >> +
> >> +		f = fdget(param.groupfd);
> >> +		if (!f.file)
> >> +			return -EBADF;
> >> +
> >> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >> +		fdput(f);
> >> +
> >> +		if (IS_ERR(vfio_group))
> >> +			return PTR_ERR(vfio_group);
> >> +
> >> +		ret = -ENOENT;
> >> +
> >> +		mutex_lock(&kv->lock);
> >> +
> >> +		list_for_each_entry(kvg, &kv->group_list, node) {
> >> +			int group_id;
> >> +			struct iommu_group *grp;
> >> +
> >> +			if (kvg->vfio_group != vfio_group)
> >> +				continue;
> >> +
> >> +			group_id = kvm_vfio_external_user_iommu_id(
> >> +					kvg->vfio_group);
> >> +			grp = iommu_group_get_by_id(group_id);
> >> +			if (!grp) {
> >> +				ret = -EFAULT;
> >> +				break;
> >> +			}
> >> +
> >> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >> +					param.tablefd, vfio_group, grp);
> >> +
> >> +			iommu_group_put(grp);
> >> +			break;
> >> +		}
> >> +
> >> +		mutex_unlock(&kv->lock);
> >> +
> >> +		return ret;
> >> +	}
> >> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> > 
> > Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
> > path to detach the group from all LIOBNs,  Or else just fail if if
> > there are LIOBNs attached.  I think it would be a qemu bug not to
> > detach the LIOBNs before removing the group, but we stil need to
> > protect the host in that case.
> 
> 
> Yeah, this bit is a bit tricky/ugly.
> 
> kvm_spapr_tce_liobn_release_iommu_group() (called from
> kvm_spapr_tce_fops::release()) drops references when a group is removed
> from the VFIO KVM device so there is no KVM_DEV_VFIO_GROUP_UNSET_SPAPR_TCE
> and no action from QEMU is required.

IF qemu simply closes the group fd.  Which it does now, but might not
always.  You still need to deal with the case where userspace does a
KVM_DEV_VFIO_GROUP_DEL instead of closing the group fd.

> 
> 
> 
> >>  
> >>  	return -ENXIO;
> >> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> >> +#endif
> >>  			return 0;
> >>  		}
> >>  
> >> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >>  	struct kvm_vfio_group *kvg, *tmp;
> >>  
> >>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  		list_del(&kvg->node);
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2017-01-12 23:53           ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2017-01-12 23:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 36945 bytes --]

On Thu, Jan 12, 2017 at 07:09:01PM +1100, Alexey Kardashevskiy wrote:
> On 12/01/17 16:04, David Gibson wrote:
> > On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN, this is done to simplify the cleanup and can be
> >> improved later.
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>
> >> This obsoletes:
> >>
> >> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
> >> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
> >> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
> >>
> >>
> >> So I have not reposted the whole thing, should have I?
> >>
> >>
> >> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
> >>
> >>
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  88 +++++++++
> >>  8 files changed, 594 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +	@flags are not supported now, must be zero;
> >> +	@groupfd is a file descriptor for a VFIO group;
> >> +	@tablefd is a file descriptor for a TCE table allocated via
> >> +		KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 28350a294b1e..3d281b7ea369 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +	struct rcu_head rcu;
> >> +	struct list_head next;
> >> +	struct vfio_group *group;
> >> +	struct iommu_table *tbl;
> >> +};
> >> +
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >> +	struct list_head iommu_tables;
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index 0a21c8503974..936138b866e7 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group, struct iommu_group *grp);
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 810f74317987..4088da4a575f 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>  
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> +struct kvm_vfio_spapr_tce {
> >> +	__u32	argsz;
> >> +	__u32	flags;
> >> +	__s32	groupfd;
> >> +	__s32	tablefd;
> >> +};
> >> +
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 15df8ae627d9..008c4aee4df6 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,10 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/file.h>
> >> +#include <linux/vfio.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -39,6 +43,20 @@
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >>  #include <asm/tce.h>
> >> +#include <asm/mmu_context.h>
> >> +
> >> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >> +{
> >> +	void (*fn)(struct vfio_group *);
> >> +
> >> +	fn = symbol_get(vfio_group_put_external_user);
> >> +	if (!fn)
> > 
> > I think this should have a WARN_ON().  If the vfio module is gone
> > while you still have VFIO groups attached to a KVM table, something
> > has gone horribly wrong.
> > 
> >> +		return;
> >> +
> >> +	fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_group_put_external_user);
> >> +}
> >>  
> >>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >>  {
> >> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >>  	return ret;
> >>  }
> >>  
> >> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> >> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> >> +
> >> +	kfree(stit);
> >> +}
> >> +
> >> +static void kvm_spapr_tce_liobn_release_iommu_group(
> >> +		struct kvmppc_spapr_tce_table *stt,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> >> +
> >> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> >> +		if (group && (stit->group != group))
> >> +			continue;
> >> +
> >> +		list_del_rcu(&stit->next);
> >> +
> >> +		iommu_table_put(stit->tbl);
> >> +		kvm_vfio_group_put_external_user(stit->group);
> >> +
> >> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> >> +	}
> >> +}
> >> +
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt;
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> >> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> >> +}
> >> +
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group, struct iommu_group *grp)
> > 
> > Isn't passing both the vfio_group and the iommu_group redundant?
> 
> vfio_group struct is internal to the vfio driver and there is no API to get
> the iommu_group pointer from it.

But in the caller you *do* derive the group from the vfio group by
going via id (ugly, but workable I guess).  Why not fold that logic
into this function.

> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt = NULL;
> >> +	bool found = false;
> >> +	struct iommu_table *tbl = NULL;
> >> +	struct iommu_table_group *table_group;
> >> +	long i;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	struct fd f;
> >> +
> >> +	f = fdget(tablefd);
> >> +	if (!f.file)
> >> +		return -EBADF;
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >> +		if (stt == f.file->private_data) {
> >> +			found = true;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	fdput(f);
> >> +
> >> +	if (!found)
> >> +		return -ENODEV;
> > 
> > Not entirely sure if ENODEV is the right error, but I can't
> > immediately think of a better one.
> > 
> >> +	table_group = iommu_group_get_iommudata(grp);
> >> +	if (!table_group)
> >> +		return -EFAULT;
> > 
> > EFAULT is usually only returned when you pass a syscall a bad pointer,
> > which doesn't look to be the case here.  What situation does this
> > error path actually represent?
> 
> 
> "something went terribly wrong".

In that case there should be a WARN_ON().  As long as it's something
terribly wrong that can't be the user's fault.

> >> +
> >> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >> +		struct iommu_table *tbltmp = table_group->tables[i];
> >> +
> >> +		if (!tbltmp)
> >> +			continue;
> >> +
> >> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> >> +				(tbltmp->it_offset == stt->offset)) {
> >> +			tbl = tbltmp;
> >> +			break;
> >> +		}
> >> +	}
> >> +	if (!tbl)
> >> +		return -ENODEV;
> >> +
> >> +	iommu_table_get(tbl);
> >> +
> >> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> >> +	stit->tbl = tbl;
> >> +	stit->group = group;
> >> +
> >> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> > 
> > Won't this add a separate stit entry for each group attached to the
> > LIOBN, even if those groups share a single hardware iommu table -
> > which is the likely case if those groups have all been put into the
> > same container.
> 
> 
> Correct. I am planning on optimizing this later.

Hmm, ok.

> >> +	return 0;
> >> +}
> >> +
> >>  static void release_spapr_tce_table(struct rcu_head *head)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>  
> >>  	list_del_rcu(&stt->list);
> >>  
> >> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> >> +
> >>  	kvm_put_kvm(stt->kvm);
> >>  
> >>  	kvmppc_account_memlimit(
> >> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	stt->offset = args->offset;
> >>  	stt->size = size;
> >>  	stt->kvm = kvm;
> >> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >>  
> >>  	for (i = 0; i < npages; i++) {
> >>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	return ret;
> >>  }
> >>  
> >> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> >> +
> >> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_HARDWARE;
> > 
> > IIUC, this error represents trying to unmap a page from the vIOMMU,
> > and discovering that it wasn't preregistered in the first place, which
> > shouldn't happen.  So would a WARN_ON() make sense here as well as the
> > H_HARDWARE.
> >
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +
> >> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +}
> >> +
> >> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >> +		return H_HARDWARE;
> > 
> > This would represent the guest trying to map a mad GPA, yes?  In which
> > case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.
> > 
> >> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_HARDWARE;
> > 
> > Here H_HARDWARE seems right. IIUC this represents the guest trying to
> > map an address which wasn't pre-registered.  That would indicate a bug
> > in qemu, which is hardware as far as the guest is concerned.
> > 
> >> +
> >> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> > 
> > Not sure what this case represents.
> 
> Preregistered memory not being able to translate userspace address to a
> host physical. In virtual mode it is a simple bounds checker, in real mode
> it also includes vmalloc_to_phys() failure.

Ok.  This caller is virtual mode only, isn't it?  If we fail the
bounds check, that sounds like a WARN_ON() + H_HARDWARE, since it
means we've translated the GPA to an insane UA.

If the translation just fails, that sounds like an H_TOO_HARD.

> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> > 
> > Or this.
> 
> A preregistered memory area is in a process of disposal, no new mappings
> are allowed.

Ok, again under control of qemu, so H_HARDWARE is reasonable.

> >> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl,
> >> +		unsigned long liobn, unsigned long ioba,
> >> +		unsigned long tce)
> >> +{
> >> +	long idx, ret = H_HARDWARE;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> >> +
> >> +	/* Clear TCE */
> >> +	if (dir == DMA_NONE) {
> >> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> >> +			return H_PARAMETER;
> >> +
> >> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
> >> +	}
> >> +
> >> +	/* Put TCE */
> >> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> >> +		return H_PARAMETER;
> >> +
> >> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> >> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
> >> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long ioba,
> >> +		u64 __user *tces, unsigned long npages)
> >> +{
> >> +	unsigned long i, ret, tce, gpa;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> > 
> > IIUC this is the virtual mode, not the real mode version.  In which
> > case you shouldn't be accessing tces[i] (a userspace pointeR) directly
> > bit should instead be using get_user().
> > 
> >> +		if (iommu_tce_put_param_check(tbl, ioba +
> >> +				(i << tbl->it_page_shift), gpa))
> >> +			return H_PARAMETER;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		tce = be64_to_cpu(tces[i]);
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
> >> +				iommu_tce_direction(tce));
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl,
> >> +		unsigned long liobn, unsigned long ioba,
> >> +		unsigned long tce_value, unsigned long npages)
> >> +{
> >> +	unsigned long i;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +
> >> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> >> +		return H_PARAMETER;
> >> +
> >> +	for (i = 0; i < npages; ++i)
> >> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		      unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > 
> > As noted above, AFAICT there is one stit per group, rather than per
> > backend IOMMU table, so if there are multiple groups in the same
> > container (and therefore attached to the same LIOBN), won't this mean
> > we duplicate this operation a bunch of times?
> > 
> >> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	unsigned long entry, ua = 0;
> >>  	u64 __user *tces;
> >>  	u64 tce;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  	tces = (u64 __user *) ua;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
> >> +				stit->tbl, ioba, tces, npages);
> >> +		if (ret != H_SUCCESS)
> >> +			goto unlock_exit;
> > 
> > Hmm, I don't suppose you could simplify things by not having a
> > put_tce_indirect() version of the whole backend iommu mapping
> > function, but just a single-TCE version, and instead looping across
> > the backend IOMMU tables as you put each indirect entry in .
> > 
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i) {
> >>  		if (get_user(tce, tces + i)) {
> >>  			ret = H_TOO_HARD;
> >> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
> >> +				tce_value, npages);
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index 8a6834e6e1c8..4d6f01712a6d 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> >>  }
> >>  
> >> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_SUCCESS;
> > 
> > What case is this?  Not being able to find the userspace duesn't sound
> > like a success.
> > 
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_SUCCESS;
> > 
> > And again..
> > 
> >> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_HARDWARE;
> > 
> > Should this have a WARN_ON?
> > 
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +
> >> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> >> +}
> >> +
> >> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa = 0, ua;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> >> +		return H_HARDWARE;
> > 
> > Again H_HARDWARE doesn't seem right here.
> > 
> >> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_HARDWARE;
> >> +
> >> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> >> +
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> >> +
> >> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> >> +
> >> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long liobn,
> >> +		unsigned long ioba, unsigned long tce)
> >> +{
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> >> +
> >> +	/* Clear TCE */
> >> +	if (dir == DMA_NONE) {
> >> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> >> +			return H_PARAMETER;
> >> +
> >> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
> >> +	}
> >> +
> >> +	/* Put TCE */
> >> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> >> +		return H_PARAMETER;
> >> +
> >> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
> >> +}
> >> +
> >> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl, unsigned long ioba,
> >> +		u64 *tces, unsigned long npages)
> >> +{
> >> +	unsigned long i, ret, tce, gpa;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +		if (iommu_tce_put_param_check(tbl, ioba +
> >> +				(i << tbl->it_page_shift), gpa))
> >> +			return H_PARAMETER;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		tce = be64_to_cpu(tces[i]);
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
> >> +				iommu_tce_direction(tce));
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> >> +		struct iommu_table *tbl,
> >> +		unsigned long liobn, unsigned long ioba,
> >> +		unsigned long tce_value, unsigned long npages)
> >> +{
> >> +	unsigned long i;
> >> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >> +
> >> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> >> +		return H_PARAMETER;
> >> +
> >> +	for (i = 0; i < npages; ++i)
> >> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
> >> +				liobn, ioba, tce);
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		 * depend on hpt.
> >>  		 */
> >>  		struct mm_iommu_table_group_mem_t *mem;
> >> +		struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>  			return H_TOO_HARD;
> >> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> >>  			return H_TOO_HARD;
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
> >> +					stit->tbl, ioba, (u64 *)tces, npages);
> >> +			if (ret != H_SUCCESS)
> >> +				return ret;
> >> +		}
> >>  	} else {
> >>  		/*
> >>  		 * This is emulated devices case.
> >> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
> >> +				liobn, ioba, tce_value, npages);
> >> +		if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >> index 70963c845e96..0e555ba998c0 100644
> >> --- a/arch/powerpc/kvm/powerpc.c
> >> +++ b/arch/powerpc/kvm/powerpc.c
> >> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >> +		/* fallthrough */
> >> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >> index d32f239eb471..3181054c8ff7 100644
> >> --- a/virt/kvm/vfio.c
> >> +++ b/virt/kvm/vfio.c
> >> @@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +#include <asm/kvm_ppc.h>
> >> +#endif
> >> +
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>  	return ret > 0;
> >>  }
> >>  
> >> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >> +{
> >> +	int (*fn)(struct vfio_group *);
> >> +	int ret = -1;
> >> +
> >> +	fn = symbol_get(vfio_external_user_iommu_id);
> >> +	if (!fn)
> >> +		return ret;
> >> +
> >> +	ret = fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_external_user_iommu_id);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  /*
> >>   * Groups can use the same or different IOMMU domains.  If the same then
> >>   * adding a new group may change the coherency of groups we've previously
> >> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  
> >>  		mutex_unlock(&kv->lock);
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >>  
> >>  		kvm_vfio_group_put_external_user(vfio_group);
> >> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>  
> >>  		return ret;
> >> +
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> >> +		struct kvm_vfio_spapr_tce param;
> >> +		unsigned long minsz;
> >> +		struct kvm_vfio *kv = dev->private;
> >> +		struct vfio_group *vfio_group;
> >> +		struct kvm_vfio_group *kvg;
> >> +		struct fd f;
> >> +
> >> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> >> +
> >> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (param.argsz < minsz || param.flags)
> >> +			return -EINVAL;
> >> +
> >> +		f = fdget(param.groupfd);
> >> +		if (!f.file)
> >> +			return -EBADF;
> >> +
> >> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >> +		fdput(f);
> >> +
> >> +		if (IS_ERR(vfio_group))
> >> +			return PTR_ERR(vfio_group);
> >> +
> >> +		ret = -ENOENT;
> >> +
> >> +		mutex_lock(&kv->lock);
> >> +
> >> +		list_for_each_entry(kvg, &kv->group_list, node) {
> >> +			int group_id;
> >> +			struct iommu_group *grp;
> >> +
> >> +			if (kvg->vfio_group != vfio_group)
> >> +				continue;
> >> +
> >> +			group_id = kvm_vfio_external_user_iommu_id(
> >> +					kvg->vfio_group);
> >> +			grp = iommu_group_get_by_id(group_id);
> >> +			if (!grp) {
> >> +				ret = -EFAULT;
> >> +				break;
> >> +			}
> >> +
> >> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >> +					param.tablefd, vfio_group, grp);
> >> +
> >> +			iommu_group_put(grp);
> >> +			break;
> >> +		}
> >> +
> >> +		mutex_unlock(&kv->lock);
> >> +
> >> +		return ret;
> >> +	}
> >> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> > 
> > Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
> > path to detach the group from all LIOBNs,  Or else just fail if if
> > there are LIOBNs attached.  I think it would be a qemu bug not to
> > detach the LIOBNs before removing the group, but we stil need to
> > protect the host in that case.
> 
> 
> Yeah, this bit is a bit tricky/ugly.
> 
> kvm_spapr_tce_liobn_release_iommu_group() (called from
> kvm_spapr_tce_fops::release()) drops references when a group is removed
> from the VFIO KVM device so there is no KVM_DEV_VFIO_GROUP_UNSET_SPAPR_TCE
> and no action from QEMU is required.

IF qemu simply closes the group fd.  Which it does now, but might not
always.  You still need to deal with the case where userspace does a
KVM_DEV_VFIO_GROUP_DEL instead of closing the group fd.

> 
> 
> 
> >>  
> >>  	return -ENXIO;
> >> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> >> +#endif
> >>  			return 0;
> >>  		}
> >>  
> >> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >>  	struct kvm_vfio_group *kvg, *tmp;
> >>  
> >>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  		list_del(&kvg->node);
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
  2017-01-12 23:53           ` David Gibson
@ 2017-01-13  2:23             ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2017-01-13  2:23 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson


[-- Attachment #1.1: Type: text/plain, Size: 37104 bytes --]

On 13/01/17 10:53, David Gibson wrote:
> On Thu, Jan 12, 2017 at 07:09:01PM +1100, Alexey Kardashevskiy wrote:
>> On 12/01/17 16:04, David Gibson wrote:
>>> On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>>>> KVM tries to handle a TCE request in the real mode, if failed
>>>> it passes the request to the virtual mode to complete the operation.
>>>> If it a virtual mode handler fails, the request is passed to
>>>> the user space; this is not expected to happen though.
>>>>
>>>> To avoid dealing with page use counters (which is tricky in real mode),
>>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>>>> to pre-register the userspace memory. The very first TCE request will
>>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>>>> of the TCE table (iommu_table::it_userspace) is not allocated till
>>>> the very first mapping happens and we cannot call vmalloc in real mode.
>>>>
>>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>>>> and associates a physical IOMMU table with the SPAPR TCE table (which
>>>> is a guest view of the hardware IOMMU table). The iommu_table object
>>>> is cached and referenced so we do not have to look up for it in real mode.
>>>>
>>>> This does not implement the UNSET counterpart as there is no use for it -
>>>> once the acceleration is enabled, the existing userspace won't
>>>> disable it unless a VFIO container is destroyed; this adds necessary
>>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>>>
>>>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>>>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>>>> descriptors with the same iommu_table (hardware IOMMU table) attached
>>>> to the same LIOBN, this is done to simplify the cleanup and can be
>>>> improved later.
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> This finally makes use of vfio_external_user_iommu_id() which was
>>>> introduced quite some time ago and was considered for removal.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v3:
>>>> * simplified not to use VFIO group notifiers
>>>> * reworked cleanup, should be cleaner/simpler now
>>>>
>>>> v2:
>>>> * reworked to use new VFIO notifiers
>>>> * now same iommu_table may appear in the list several times, to be fixed later
>>>> ---
>>>>
>>>> This obsoletes:
>>>>
>>>> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
>>>> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
>>>> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
>>>>
>>>>
>>>> So I have not reposted the whole thing, should have I?
>>>>
>>>>
>>>> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
>>>>
>>>>
>>>> ---
>>>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>>>  include/uapi/linux/kvm.h                   |   8 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
>>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>>>  virt/kvm/vfio.c                            |  88 +++++++++
>>>>  8 files changed, 594 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740c67ca..f95d867168ea 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,25 @@ Groups:
>>>>  
>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>>>> +	allocated by sPAPR KVM.
>>>> +	kvm_device_attr.addr points to a struct:
>>>>  
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +	struct kvm_vfio_spapr_tce {
>>>> +		__u32	argsz;
>>>> +		__u32	flags;
>>>> +		__s32	groupfd;
>>>> +		__s32	tablefd;
>>>> +	};
>>>> +
>>>> +	where
>>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>> +	@flags are not supported now, must be zero;
>>>> +	@groupfd is a file descriptor for a VFIO group;
>>>> +	@tablefd is a file descriptor for a TCE table allocated via
>>>> +		KVM_CREATE_SPAPR_TCE.
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index 28350a294b1e..3d281b7ea369 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>>>  	atomic_t refcnt;
>>>>  };
>>>>  
>>>> +struct kvmppc_spapr_tce_iommu_table {
>>>> +	struct rcu_head rcu;
>>>> +	struct list_head next;
>>>> +	struct vfio_group *group;
>>>> +	struct iommu_table *tbl;
>>>> +};
>>>> +
>>>>  struct kvmppc_spapr_tce_table {
>>>>  	struct list_head list;
>>>>  	struct kvm *kvm;
>>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>>>  	u32 page_shift;
>>>>  	u64 offset;		/* in pages */
>>>>  	u64 size;		/* window size in pages */
>>>> +	struct list_head iommu_tables;
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index 0a21c8503974..936138b866e7 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>>>> +		struct vfio_group *group, struct iommu_group *grp);
>>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *group);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce_64 *args);
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 810f74317987..4088da4a575f 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>>>  
>>>>  enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_MAX,
>>>>  };
>>>>  
>>>> +struct kvm_vfio_spapr_tce {
>>>> +	__u32	argsz;
>>>> +	__u32	flags;
>>>> +	__s32	groupfd;
>>>> +	__s32	tablefd;
>>>> +};
>>>> +
>>>>  /*
>>>>   * ioctls for VM fds
>>>>   */
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 15df8ae627d9..008c4aee4df6 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -27,6 +27,10 @@
>>>>  #include <linux/hugetlb.h>
>>>>  #include <linux/list.h>
>>>>  #include <linux/anon_inodes.h>
>>>> +#include <linux/iommu.h>
>>>> +#include <linux/file.h>
>>>> +#include <linux/vfio.h>
>>>> +#include <linux/module.h>
>>>>  
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/kvm_ppc.h>
>>>> @@ -39,6 +43,20 @@
>>>>  #include <asm/udbg.h>
>>>>  #include <asm/iommu.h>
>>>>  #include <asm/tce.h>
>>>> +#include <asm/mmu_context.h>
>>>> +
>>>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>> +{
>>>> +	void (*fn)(struct vfio_group *);
>>>> +
>>>> +	fn = symbol_get(vfio_group_put_external_user);
>>>> +	if (!fn)
>>>
>>> I think this should have a WARN_ON().  If the vfio module is gone
>>> while you still have VFIO groups attached to a KVM table, something
>>> has gone horribly wrong.
>>>
>>>> +		return;
>>>> +
>>>> +	fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_group_put_external_user);
>>>> +}
>>>>  
>>>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>>>  {
>>>> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>>>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>>>> +
>>>> +	kfree(stit);
>>>> +}
>>>> +
>>>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>>>> +		struct kvmppc_spapr_tce_table *stt,
>>>> +		struct vfio_group *group)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>>>> +
>>>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>>>> +		if (group && (stit->group != group))
>>>> +			continue;
>>>> +
>>>> +		list_del_rcu(&stit->next);
>>>> +
>>>> +		iommu_table_put(stit->tbl);
>>>> +		kvm_vfio_group_put_external_user(stit->group);
>>>> +
>>>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>>>> +	}
>>>> +}
>>>> +
>>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *group)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_table *stt;
>>>> +
>>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>>>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>>>> +}
>>>> +
>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>>>> +		struct vfio_group *group, struct iommu_group *grp)
>>>
>>> Isn't passing both the vfio_group and the iommu_group redundant?
>>
>> vfio_group struct is internal to the vfio driver and there is no API to get
>> the iommu_group pointer from it.
> 
> But in the caller you *do* derive the group from the vfio group by
> going via id (ugly, but workable I guess).  Why not fold that logic
> into this function.


I could, either way looks equally ugly-ish to me, rework it?


>>>> +{
>>>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>>>> +	bool found = false;
>>>> +	struct iommu_table *tbl = NULL;
>>>> +	struct iommu_table_group *table_group;
>>>> +	long i;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>> +	struct fd f;
>>>> +
>>>> +	f = fdget(tablefd);
>>>> +	if (!f.file)
>>>> +		return -EBADF;
>>>> +
>>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>>>> +		if (stt == f.file->private_data) {
>>>> +			found = true;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	fdput(f);
>>>> +
>>>> +	if (!found)
>>>> +		return -ENODEV;
>>>
>>> Not entirely sure if ENODEV is the right error, but I can't
>>> immediately think of a better one.


btw I still do not have a better candidate, can I keep this as is?


>>>
>>>> +	table_group = iommu_group_get_iommudata(grp);
>>>> +	if (!table_group)
>>>> +		return -EFAULT;
>>>
>>> EFAULT is usually only returned when you pass a syscall a bad pointer,
>>> which doesn't look to be the case here.  What situation does this
>>> error path actually represent?
>>
>>
>> "something went terribly wrong".
> 
> In that case there should be a WARN_ON().  As long as it's something
> terribly wrong that can't be the user's fault.
> 
>>>> +
>>>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>>>> +		struct iommu_table *tbltmp = table_group->tables[i];
>>>> +
>>>> +		if (!tbltmp)
>>>> +			continue;
>>>> +
>>>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>>>> +				(tbltmp->it_offset == stt->offset)) {
>>>> +			tbl = tbltmp;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +	if (!tbl)
>>>> +		return -ENODEV;
>>>> +
>>>> +	iommu_table_get(tbl);
>>>> +
>>>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>>>> +	stit->tbl = tbl;
>>>> +	stit->group = group;
>>>> +
>>>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
>>>
>>> Won't this add a separate stit entry for each group attached to the
>>> LIOBN, even if those groups share a single hardware iommu table -
>>> which is the likely case if those groups have all been put into the
>>> same container.
>>
>>
>> Correct. I am planning on optimizing this later.
> 
> Hmm, ok.
> 
>>>> +	return 0;
>>>> +}
>>>> +
>>>>  static void release_spapr_tce_table(struct rcu_head *head)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>>>> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>>>  
>>>>  	list_del_rcu(&stt->list);
>>>>  
>>>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>>>> +
>>>>  	kvm_put_kvm(stt->kvm);
>>>>  
>>>>  	kvmppc_account_memlimit(
>>>> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  	stt->offset = args->offset;
>>>>  	stt->size = size;
>>>>  	stt->kvm = kvm;
>>>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>>>  
>>>>  	for (i = 0; i < npages; i++) {
>>>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> +	if (!pua)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>
>>> IIUC, this error represents trying to unmap a page from the vIOMMU,
>>> and discovering that it wasn't preregistered in the first place, which
>>> shouldn't happen.  So would a WARN_ON() make sense here as well as the
>>> H_HARDWARE.
>>>
>>>> +	mm_iommu_mapped_dec(mem);
>>>> +
>>>> +	*pua = 0;
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +	unsigned long hpa = 0;
>>>> +
>>>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (dir == DMA_NONE)
>>>> +		return H_SUCCESS;
>>>> +
>>>> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>> +}
>>>> +
>>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>>>> +		unsigned long entry, unsigned long gpa,
>>>> +		enum dma_data_direction dir)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> +	if (!pua)
>>>> +		/* it_userspace allocation might be delayed */
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>>>> +		return H_HARDWARE;
>>>
>>> This would represent the guest trying to map a mad GPA, yes?  In which
>>> case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.
>>>
>>>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>
>>> Here H_HARDWARE seems right. IIUC this represents the guest trying to
>>> map an address which wasn't pre-registered.  That would indicate a bug
>>> in qemu, which is hardware as far as the guest is concerned.
>>>
>>>> +
>>>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>>>> +		return H_HARDWARE;
>>>
>>> Not sure what this case represents.
>>
>> Preregistered memory not being able to translate userspace address to a
>> host physical. In virtual mode it is a simple bounds checker, in real mode
>> it also includes vmalloc_to_phys() failure.
> 
> Ok.  This caller is virtual mode only, isn't it?  If we fail the
> bounds check, that sounds like a WARN_ON() + H_HARDWARE, since it
> means we've translated the GPA to an insane UA.
> 
> If the translation just fails, that sounds like an H_TOO_HARD.
> 
>>>> +	if (mm_iommu_mapped_inc(mem))
>>>> +		return H_HARDWARE;
>>>
>>> Or this.
>>
>> A preregistered memory area is in a process of disposal, no new mappings
>> are allowed.
> 
> Ok, again under control of qemu, so H_HARDWARE is reasonable.
> 
>>>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>> +	if (ret) {
>>>> +		mm_iommu_mapped_dec(mem);
>>>> +		return H_TOO_HARD;
>>>> +	}
>>>> +
>>>> +	if (dir != DMA_NONE)
>>>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>> +
>>>> +	*pua = ua;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl,
>>>> +		unsigned long liobn, unsigned long ioba,
>>>> +		unsigned long tce)
>>>> +{
>>>> +	long idx, ret = H_HARDWARE;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
>>>> +
>>>> +	/* Clear TCE */
>>>> +	if (dir == DMA_NONE) {
>>>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
>>>> +			return H_PARAMETER;
>>>> +
>>>> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
>>>> +	}
>>>> +
>>>> +	/* Put TCE */
>>>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>>>> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
>>>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long ioba,
>>>> +		u64 __user *tces, unsigned long npages)
>>>> +{
>>>> +	unsigned long i, ret, tce, gpa;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>
>>> IIUC this is the virtual mode, not the real mode version.  In which
>>> case you shouldn't be accessing tces[i] (a userspace pointeR) directly
>>> bit should instead be using get_user().
>>>
>>>> +		if (iommu_tce_put_param_check(tbl, ioba +
>>>> +				(i << tbl->it_page_shift), gpa))
>>>> +			return H_PARAMETER;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		tce = be64_to_cpu(tces[i]);
>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +
>>>> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
>>>> +				iommu_tce_direction(tce));
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl,
>>>> +		unsigned long liobn, unsigned long ioba,
>>>> +		unsigned long tce_value, unsigned long npages)
>>>> +{
>>>> +	unsigned long i;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +
>>>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	for (i = 0; i < npages; ++i)
>>>> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  		      unsigned long ioba, unsigned long tce)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>  	/* 	    liobn, ioba, tce); */
>>>> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  	if (ret != H_SUCCESS)
>>>>  		return ret;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>
>>> As noted above, AFAICT there is one stit per group, rather than per
>>> backend IOMMU table, so if there are multiple groups in the same
>>> container (and therefore attached to the same LIOBN), won't this mean
>>> we duplicate this operation a bunch of times?
>>>
>>>> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>>>  
>>>>  	return H_SUCCESS;
>>>> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  	unsigned long entry, ua = 0;
>>>>  	u64 __user *tces;
>>>>  	u64 tce;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  	}
>>>>  	tces = (u64 __user *) ua;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
>>>> +				stit->tbl, ioba, tces, npages);
>>>> +		if (ret != H_SUCCESS)
>>>> +			goto unlock_exit;
>>>
>>> Hmm, I don't suppose you could simplify things by not having a
>>> put_tce_indirect() version of the whole backend iommu mapping
>>> function, but just a single-TCE version, and instead looping across
>>> the backend IOMMU tables as you put each indirect entry in .
>>>
>>>> +	}
>>>> +
>>>>  	for (i = 0; i < npages; ++i) {
>>>>  		if (get_user(tce, tces + i)) {
>>>>  			ret = H_TOO_HARD;
>>>> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>>>  		return H_PARAMETER;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
>>>> +				tce_value, npages);
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>>>  
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 8a6834e6e1c8..4d6f01712a6d 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>>>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>>>>  }
>>>>  
>>>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> +	if (!pua)
>>>> +		return H_SUCCESS;
>>>
>>> What case is this?  Not being able to find the userspace duesn't sound
>>> like a success.
>>>
>>>> +	pua = (void *) vmalloc_to_phys(pua);
>>>> +	if (!pua)
>>>> +		return H_SUCCESS;
>>>
>>> And again..
>>>
>>>> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>
>>> Should this have a WARN_ON?
>>>
>>>> +	mm_iommu_mapped_dec(mem);
>>>> +
>>>> +	*pua = 0;
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +	unsigned long hpa = 0;
>>>> +
>>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (dir == DMA_NONE)
>>>> +		return H_SUCCESS;
>>>> +
>>>> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
>>>> +}
>>>> +
>>>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
>>>> +		unsigned long entry, unsigned long gpa,
>>>> +		enum dma_data_direction dir)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long hpa = 0, ua;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> +	if (!pua)
>>>> +		/* it_userspace allocation might be delayed */
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
>>>> +		return H_HARDWARE;
>>>
>>> Again H_HARDWARE doesn't seem right here.
>>>
>>>> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	pua = (void *) vmalloc_to_phys(pua);
>>>> +	if (!pua)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (mm_iommu_mapped_inc(mem))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>>>> +	if (ret) {
>>>> +		mm_iommu_mapped_dec(mem);
>>>> +		return H_TOO_HARD;
>>>> +	}
>>>> +
>>>> +	if (dir != DMA_NONE)
>>>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
>>>> +
>>>> +	*pua = ua;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
>>>> +
>>>> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long liobn,
>>>> +		unsigned long ioba, unsigned long tce)
>>>> +{
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
>>>> +
>>>> +	/* Clear TCE */
>>>> +	if (dir == DMA_NONE) {
>>>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
>>>> +			return H_PARAMETER;
>>>> +
>>>> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
>>>> +	}
>>>> +
>>>> +	/* Put TCE */
>>>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long ioba,
>>>> +		u64 *tces, unsigned long npages)
>>>> +{
>>>> +	unsigned long i, ret, tce, gpa;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +
>>>> +		if (iommu_tce_put_param_check(tbl, ioba +
>>>> +				(i << tbl->it_page_shift), gpa))
>>>> +			return H_PARAMETER;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		tce = be64_to_cpu(tces[i]);
>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +
>>>> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
>>>> +				iommu_tce_direction(tce));
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl,
>>>> +		unsigned long liobn, unsigned long ioba,
>>>> +		unsigned long tce_value, unsigned long npages)
>>>> +{
>>>> +	unsigned long i;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +
>>>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	for (i = 0; i < npages; ++i)
>>>> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  		unsigned long ioba, unsigned long tce)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>  	/* 	    liobn, ioba, tce); */
>>>> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  	if (ret != H_SUCCESS)
>>>>  		return ret;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
>>>> +				liobn, ioba, tce);
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>>>  
>>>>  	return H_SUCCESS;
>>>> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  		 * depend on hpt.
>>>>  		 */
>>>>  		struct mm_iommu_table_group_mem_t *mem;
>>>> +		struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>>>  			return H_TOO_HARD;
>>>> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>>>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>>>>  			return H_TOO_HARD;
>>>> +
>>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
>>>> +					stit->tbl, ioba, (u64 *)tces, npages);
>>>> +			if (ret != H_SUCCESS)
>>>> +				return ret;
>>>> +		}
>>>>  	} else {
>>>>  		/*
>>>>  		 * This is emulated devices case.
>>>> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>> +
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>>>  		return H_PARAMETER;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
>>>> +				liobn, ioba, tce_value, npages);
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>>>  
>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>> index 70963c845e96..0e555ba998c0 100644
>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>> +		/* fallthrough */
>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>  	case KVM_CAP_PPC_RTAS:
>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>> index d32f239eb471..3181054c8ff7 100644
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -20,6 +20,10 @@
>>>>  #include <linux/vfio.h>
>>>>  #include "vfio.h"
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +#include <asm/kvm_ppc.h>
>>>> +#endif
>>>> +
>>>>  struct kvm_vfio_group {
>>>>  	struct list_head node;
>>>>  	struct vfio_group *vfio_group;
>>>> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>  	return ret > 0;
>>>>  }
>>>>  
>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>> +{
>>>> +	int (*fn)(struct vfio_group *);
>>>> +	int ret = -1;
>>>> +
>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>> +	if (!fn)
>>>> +		return ret;
>>>> +
>>>> +	ret = fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  /*
>>>>   * Groups can use the same or different IOMMU domains.  If the same then
>>>>   * adding a new group may change the coherency of groups we've previously
>>>> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>  
>>>>  		mutex_unlock(&kv->lock);
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>>>> +#endif
>>>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>>>  
>>>>  		kvm_vfio_group_put_external_user(vfio_group);
>>>> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>  		kvm_vfio_update_coherency(dev);
>>>>  
>>>>  		return ret;
>>>> +
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>>>> +		struct kvm_vfio_spapr_tce param;
>>>> +		unsigned long minsz;
>>>> +		struct kvm_vfio *kv = dev->private;
>>>> +		struct vfio_group *vfio_group;
>>>> +		struct kvm_vfio_group *kvg;
>>>> +		struct fd f;
>>>> +
>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>>>> +
>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (param.argsz < minsz || param.flags)
>>>> +			return -EINVAL;
>>>> +
>>>> +		f = fdget(param.groupfd);
>>>> +		if (!f.file)
>>>> +			return -EBADF;
>>>> +
>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>> +		fdput(f);
>>>> +
>>>> +		if (IS_ERR(vfio_group))
>>>> +			return PTR_ERR(vfio_group);
>>>> +
>>>> +		ret = -ENOENT;
>>>> +
>>>> +		mutex_lock(&kv->lock);
>>>> +
>>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>>>> +			int group_id;
>>>> +			struct iommu_group *grp;
>>>> +
>>>> +			if (kvg->vfio_group != vfio_group)
>>>> +				continue;
>>>> +
>>>> +			group_id = kvm_vfio_external_user_iommu_id(
>>>> +					kvg->vfio_group);
>>>> +			grp = iommu_group_get_by_id(group_id);
>>>> +			if (!grp) {
>>>> +				ret = -EFAULT;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>>>> +					param.tablefd, vfio_group, grp);
>>>> +
>>>> +			iommu_group_put(grp);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		mutex_unlock(&kv->lock);
>>>> +
>>>> +		return ret;
>>>> +	}
>>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>>>  	}
>>>
>>> Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
>>> path to detach the group from all LIOBNs,  Or else just fail if if
>>> there are LIOBNs attached.  I think it would be a qemu bug not to
>>> detach the LIOBNs before removing the group, but we stil need to
>>> protect the host in that case.
>>
>>
>> Yeah, this bit is a bit tricky/ugly.
>>
>> kvm_spapr_tce_liobn_release_iommu_group() (called from
>> kvm_spapr_tce_fops::release()) drops references when a group is removed
>> from the VFIO KVM device so there is no KVM_DEV_VFIO_GROUP_UNSET_SPAPR_TCE
>> and no action from QEMU is required.
> 
> IF qemu simply closes the group fd.  Which it does now, but might not
> always.  You still need to deal with the case where userspace does a
> KVM_DEV_VFIO_GROUP_DEL instead of closing the group fd.


This patch adds kvm_spapr_tce_release_iommu_group() to the
KVM_DEV_VFIO_GROUP_DEL handler.



> 
>>
>>
>>
>>>>  
>>>>  	return -ENXIO;
>>>> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>>>  		switch (attr->attr) {
>>>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>>>  		case KVM_DEV_VFIO_GROUP_DEL:
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>>>> +#endif
>>>>  			return 0;
>>>>  		}
>>>>  
>>>> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>>>  	struct kvm_vfio_group *kvg, *tmp;
>>>>  
>>>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>>>> +#endif
>>>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>  		list_del(&kvg->node);
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2017-01-13  2:23             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2017-01-13  2:23 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson


[-- Attachment #1.1: Type: text/plain, Size: 37104 bytes --]

On 13/01/17 10:53, David Gibson wrote:
> On Thu, Jan 12, 2017 at 07:09:01PM +1100, Alexey Kardashevskiy wrote:
>> On 12/01/17 16:04, David Gibson wrote:
>>> On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>>>> KVM tries to handle a TCE request in the real mode, if failed
>>>> it passes the request to the virtual mode to complete the operation.
>>>> If it a virtual mode handler fails, the request is passed to
>>>> the user space; this is not expected to happen though.
>>>>
>>>> To avoid dealing with page use counters (which is tricky in real mode),
>>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>>>> to pre-register the userspace memory. The very first TCE request will
>>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>>>> of the TCE table (iommu_table::it_userspace) is not allocated till
>>>> the very first mapping happens and we cannot call vmalloc in real mode.
>>>>
>>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>>>> and associates a physical IOMMU table with the SPAPR TCE table (which
>>>> is a guest view of the hardware IOMMU table). The iommu_table object
>>>> is cached and referenced so we do not have to look up for it in real mode.
>>>>
>>>> This does not implement the UNSET counterpart as there is no use for it -
>>>> once the acceleration is enabled, the existing userspace won't
>>>> disable it unless a VFIO container is destroyed; this adds necessary
>>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>>>
>>>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>>>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>>>> descriptors with the same iommu_table (hardware IOMMU table) attached
>>>> to the same LIOBN, this is done to simplify the cleanup and can be
>>>> improved later.
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> This finally makes use of vfio_external_user_iommu_id() which was
>>>> introduced quite some time ago and was considered for removal.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v3:
>>>> * simplified not to use VFIO group notifiers
>>>> * reworked cleanup, should be cleaner/simpler now
>>>>
>>>> v2:
>>>> * reworked to use new VFIO notifiers
>>>> * now same iommu_table may appear in the list several times, to be fixed later
>>>> ---
>>>>
>>>> This obsoletes:
>>>>
>>>> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
>>>> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
>>>> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
>>>>
>>>>
>>>> So I have not reposted the whole thing, should have I?
>>>>
>>>>
>>>> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
>>>>
>>>>
>>>> ---
>>>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>>>  include/uapi/linux/kvm.h                   |   8 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
>>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>>>  virt/kvm/vfio.c                            |  88 +++++++++
>>>>  8 files changed, 594 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740c67ca..f95d867168ea 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,25 @@ Groups:
>>>>  
>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>>>> +	allocated by sPAPR KVM.
>>>> +	kvm_device_attr.addr points to a struct:
>>>>  
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +	struct kvm_vfio_spapr_tce {
>>>> +		__u32	argsz;
>>>> +		__u32	flags;
>>>> +		__s32	groupfd;
>>>> +		__s32	tablefd;
>>>> +	};
>>>> +
>>>> +	where
>>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>> +	@flags are not supported now, must be zero;
>>>> +	@groupfd is a file descriptor for a VFIO group;
>>>> +	@tablefd is a file descriptor for a TCE table allocated via
>>>> +		KVM_CREATE_SPAPR_TCE.
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index 28350a294b1e..3d281b7ea369 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>>>  	atomic_t refcnt;
>>>>  };
>>>>  
>>>> +struct kvmppc_spapr_tce_iommu_table {
>>>> +	struct rcu_head rcu;
>>>> +	struct list_head next;
>>>> +	struct vfio_group *group;
>>>> +	struct iommu_table *tbl;
>>>> +};
>>>> +
>>>>  struct kvmppc_spapr_tce_table {
>>>>  	struct list_head list;
>>>>  	struct kvm *kvm;
>>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>>>  	u32 page_shift;
>>>>  	u64 offset;		/* in pages */
>>>>  	u64 size;		/* window size in pages */
>>>> +	struct list_head iommu_tables;
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index 0a21c8503974..936138b866e7 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>>>> +		struct vfio_group *group, struct iommu_group *grp);
>>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *group);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce_64 *args);
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 810f74317987..4088da4a575f 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>>>  
>>>>  enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_MAX,
>>>>  };
>>>>  
>>>> +struct kvm_vfio_spapr_tce {
>>>> +	__u32	argsz;
>>>> +	__u32	flags;
>>>> +	__s32	groupfd;
>>>> +	__s32	tablefd;
>>>> +};
>>>> +
>>>>  /*
>>>>   * ioctls for VM fds
>>>>   */
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 15df8ae627d9..008c4aee4df6 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -27,6 +27,10 @@
>>>>  #include <linux/hugetlb.h>
>>>>  #include <linux/list.h>
>>>>  #include <linux/anon_inodes.h>
>>>> +#include <linux/iommu.h>
>>>> +#include <linux/file.h>
>>>> +#include <linux/vfio.h>
>>>> +#include <linux/module.h>
>>>>  
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/kvm_ppc.h>
>>>> @@ -39,6 +43,20 @@
>>>>  #include <asm/udbg.h>
>>>>  #include <asm/iommu.h>
>>>>  #include <asm/tce.h>
>>>> +#include <asm/mmu_context.h>
>>>> +
>>>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>> +{
>>>> +	void (*fn)(struct vfio_group *);
>>>> +
>>>> +	fn = symbol_get(vfio_group_put_external_user);
>>>> +	if (!fn)
>>>
>>> I think this should have a WARN_ON().  If the vfio module is gone
>>> while you still have VFIO groups attached to a KVM table, something
>>> has gone horribly wrong.
>>>
>>>> +		return;
>>>> +
>>>> +	fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_group_put_external_user);
>>>> +}
>>>>  
>>>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>>>  {
>>>> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>>>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>>>> +
>>>> +	kfree(stit);
>>>> +}
>>>> +
>>>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>>>> +		struct kvmppc_spapr_tce_table *stt,
>>>> +		struct vfio_group *group)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>>>> +
>>>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>>>> +		if (group && (stit->group != group))
>>>> +			continue;
>>>> +
>>>> +		list_del_rcu(&stit->next);
>>>> +
>>>> +		iommu_table_put(stit->tbl);
>>>> +		kvm_vfio_group_put_external_user(stit->group);
>>>> +
>>>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>>>> +	}
>>>> +}
>>>> +
>>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *group)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_table *stt;
>>>> +
>>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>>>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>>>> +}
>>>> +
>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>>>> +		struct vfio_group *group, struct iommu_group *grp)
>>>
>>> Isn't passing both the vfio_group and the iommu_group redundant?
>>
>> vfio_group struct is internal to the vfio driver and there is no API to get
>> the iommu_group pointer from it.
> 
> But in the caller you *do* derive the group from the vfio group by
> going via id (ugly, but workable I guess).  Why not fold that logic
> into this function.


I could, either way looks equally ugly-ish to me, rework it?


>>>> +{
>>>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>>>> +	bool found = false;
>>>> +	struct iommu_table *tbl = NULL;
>>>> +	struct iommu_table_group *table_group;
>>>> +	long i;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>> +	struct fd f;
>>>> +
>>>> +	f = fdget(tablefd);
>>>> +	if (!f.file)
>>>> +		return -EBADF;
>>>> +
>>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>>>> +		if (stt == f.file->private_data) {
>>>> +			found = true;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	fdput(f);
>>>> +
>>>> +	if (!found)
>>>> +		return -ENODEV;
>>>
>>> Not entirely sure if ENODEV is the right error, but I can't
>>> immediately think of a better one.


btw I still do not have a better candidate, can I keep this as is?


>>>
>>>> +	table_group = iommu_group_get_iommudata(grp);
>>>> +	if (!table_group)
>>>> +		return -EFAULT;
>>>
>>> EFAULT is usually only returned when you pass a syscall a bad pointer,
>>> which doesn't look to be the case here.  What situation does this
>>> error path actually represent?
>>
>>
>> "something went terribly wrong".
> 
> In that case there should be a WARN_ON().  As long as it's something
> terribly wrong that can't be the user's fault.
> 
>>>> +
>>>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>>>> +		struct iommu_table *tbltmp = table_group->tables[i];
>>>> +
>>>> +		if (!tbltmp)
>>>> +			continue;
>>>> +
>>>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>>>> +				(tbltmp->it_offset == stt->offset)) {
>>>> +			tbl = tbltmp;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +	if (!tbl)
>>>> +		return -ENODEV;
>>>> +
>>>> +	iommu_table_get(tbl);
>>>> +
>>>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>>>> +	stit->tbl = tbl;
>>>> +	stit->group = group;
>>>> +
>>>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
>>>
>>> Won't this add a separate stit entry for each group attached to the
>>> LIOBN, even if those groups share a single hardware iommu table -
>>> which is the likely case if those groups have all been put into the
>>> same container.
>>
>>
>> Correct. I am planning on optimizing this later.
> 
> Hmm, ok.
> 
>>>> +	return 0;
>>>> +}
>>>> +
>>>>  static void release_spapr_tce_table(struct rcu_head *head)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>>>> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>>>  
>>>>  	list_del_rcu(&stt->list);
>>>>  
>>>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>>>> +
>>>>  	kvm_put_kvm(stt->kvm);
>>>>  
>>>>  	kvmppc_account_memlimit(
>>>> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  	stt->offset = args->offset;
>>>>  	stt->size = size;
>>>>  	stt->kvm = kvm;
>>>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>>>  
>>>>  	for (i = 0; i < npages; i++) {
>>>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> +	if (!pua)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>
>>> IIUC, this error represents trying to unmap a page from the vIOMMU,
>>> and discovering that it wasn't preregistered in the first place, which
>>> shouldn't happen.  So would a WARN_ON() make sense here as well as the
>>> H_HARDWARE.
>>>
>>>> +	mm_iommu_mapped_dec(mem);
>>>> +
>>>> +	*pua = 0;
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +	unsigned long hpa = 0;
>>>> +
>>>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (dir == DMA_NONE)
>>>> +		return H_SUCCESS;
>>>> +
>>>> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>> +}
>>>> +
>>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>>>> +		unsigned long entry, unsigned long gpa,
>>>> +		enum dma_data_direction dir)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> +	if (!pua)
>>>> +		/* it_userspace allocation might be delayed */
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>>>> +		return H_HARDWARE;
>>>
>>> This would represent the guest trying to map a mad GPA, yes?  In which
>>> case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.
>>>
>>>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>
>>> Here H_HARDWARE seems right. IIUC this represents the guest trying to
>>> map an address which wasn't pre-registered.  That would indicate a bug
>>> in qemu, which is hardware as far as the guest is concerned.
>>>
>>>> +
>>>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>>>> +		return H_HARDWARE;
>>>
>>> Not sure what this case represents.
>>
>> Preregistered memory not being able to translate userspace address to a
>> host physical. In virtual mode it is a simple bounds checker, in real mode
>> it also includes vmalloc_to_phys() failure.
> 
> Ok.  This caller is virtual mode only, isn't it?  If we fail the
> bounds check, that sounds like a WARN_ON() + H_HARDWARE, since it
> means we've translated the GPA to an insane UA.
> 
> If the translation just fails, that sounds like an H_TOO_HARD.
> 
>>>> +	if (mm_iommu_mapped_inc(mem))
>>>> +		return H_HARDWARE;
>>>
>>> Or this.
>>
>> A preregistered memory area is in a process of disposal, no new mappings
>> are allowed.
> 
> Ok, again under control of qemu, so H_HARDWARE is reasonable.
> 
>>>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>> +	if (ret) {
>>>> +		mm_iommu_mapped_dec(mem);
>>>> +		return H_TOO_HARD;
>>>> +	}
>>>> +
>>>> +	if (dir != DMA_NONE)
>>>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>> +
>>>> +	*pua = ua;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl,
>>>> +		unsigned long liobn, unsigned long ioba,
>>>> +		unsigned long tce)
>>>> +{
>>>> +	long idx, ret = H_HARDWARE;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
>>>> +
>>>> +	/* Clear TCE */
>>>> +	if (dir == DMA_NONE) {
>>>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
>>>> +			return H_PARAMETER;
>>>> +
>>>> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
>>>> +	}
>>>> +
>>>> +	/* Put TCE */
>>>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>>>> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
>>>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long ioba,
>>>> +		u64 __user *tces, unsigned long npages)
>>>> +{
>>>> +	unsigned long i, ret, tce, gpa;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>
>>> IIUC this is the virtual mode, not the real mode version.  In which
>>> case you shouldn't be accessing tces[i] (a userspace pointeR) directly
>>> bit should instead be using get_user().
>>>
>>>> +		if (iommu_tce_put_param_check(tbl, ioba +
>>>> +				(i << tbl->it_page_shift), gpa))
>>>> +			return H_PARAMETER;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		tce = be64_to_cpu(tces[i]);
>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +
>>>> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
>>>> +				iommu_tce_direction(tce));
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl,
>>>> +		unsigned long liobn, unsigned long ioba,
>>>> +		unsigned long tce_value, unsigned long npages)
>>>> +{
>>>> +	unsigned long i;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +
>>>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	for (i = 0; i < npages; ++i)
>>>> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  		      unsigned long ioba, unsigned long tce)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>  	/* 	    liobn, ioba, tce); */
>>>> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  	if (ret != H_SUCCESS)
>>>>  		return ret;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>
>>> As noted above, AFAICT there is one stit per group, rather than per
>>> backend IOMMU table, so if there are multiple groups in the same
>>> container (and therefore attached to the same LIOBN), won't this mean
>>> we duplicate this operation a bunch of times?
>>>
>>>> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>>>  
>>>>  	return H_SUCCESS;
>>>> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  	unsigned long entry, ua = 0;
>>>>  	u64 __user *tces;
>>>>  	u64 tce;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  	}
>>>>  	tces = (u64 __user *) ua;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
>>>> +				stit->tbl, ioba, tces, npages);
>>>> +		if (ret != H_SUCCESS)
>>>> +			goto unlock_exit;
>>>
>>> Hmm, I don't suppose you could simplify things by not having a
>>> put_tce_indirect() version of the whole backend iommu mapping
>>> function, but just a single-TCE version, and instead looping across
>>> the backend IOMMU tables as you put each indirect entry in .
>>>
>>>> +	}
>>>> +
>>>>  	for (i = 0; i < npages; ++i) {
>>>>  		if (get_user(tce, tces + i)) {
>>>>  			ret = H_TOO_HARD;
>>>> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>>>  		return H_PARAMETER;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
>>>> +				tce_value, npages);
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>>>  
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 8a6834e6e1c8..4d6f01712a6d 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>>>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>>>>  }
>>>>  
>>>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> +	if (!pua)
>>>> +		return H_SUCCESS;
>>>
>>> What case is this?  Not being able to find the userspace duesn't sound
>>> like a success.
>>>
>>>> +	pua = (void *) vmalloc_to_phys(pua);
>>>> +	if (!pua)
>>>> +		return H_SUCCESS;
>>>
>>> And again..
>>>
>>>> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>
>>> Should this have a WARN_ON?
>>>
>>>> +	mm_iommu_mapped_dec(mem);
>>>> +
>>>> +	*pua = 0;
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +	unsigned long hpa = 0;
>>>> +
>>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (dir == DMA_NONE)
>>>> +		return H_SUCCESS;
>>>> +
>>>> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
>>>> +}
>>>> +
>>>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
>>>> +		unsigned long entry, unsigned long gpa,
>>>> +		enum dma_data_direction dir)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long hpa = 0, ua;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> +	if (!pua)
>>>> +		/* it_userspace allocation might be delayed */
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
>>>> +		return H_HARDWARE;
>>>
>>> Again H_HARDWARE doesn't seem right here.
>>>
>>>> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	pua = (void *) vmalloc_to_phys(pua);
>>>> +	if (!pua)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (mm_iommu_mapped_inc(mem))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>>>> +	if (ret) {
>>>> +		mm_iommu_mapped_dec(mem);
>>>> +		return H_TOO_HARD;
>>>> +	}
>>>> +
>>>> +	if (dir != DMA_NONE)
>>>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
>>>> +
>>>> +	*pua = ua;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
>>>> +
>>>> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long liobn,
>>>> +		unsigned long ioba, unsigned long tce)
>>>> +{
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
>>>> +
>>>> +	/* Clear TCE */
>>>> +	if (dir == DMA_NONE) {
>>>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
>>>> +			return H_PARAMETER;
>>>> +
>>>> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
>>>> +	}
>>>> +
>>>> +	/* Put TCE */
>>>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl, unsigned long ioba,
>>>> +		u64 *tces, unsigned long npages)
>>>> +{
>>>> +	unsigned long i, ret, tce, gpa;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +
>>>> +		if (iommu_tce_put_param_check(tbl, ioba +
>>>> +				(i << tbl->it_page_shift), gpa))
>>>> +			return H_PARAMETER;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		tce = be64_to_cpu(tces[i]);
>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +
>>>> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
>>>> +				iommu_tce_direction(tce));
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
>>>> +		struct iommu_table *tbl,
>>>> +		unsigned long liobn, unsigned long ioba,
>>>> +		unsigned long tce_value, unsigned long npages)
>>>> +{
>>>> +	unsigned long i;
>>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
>>>> +
>>>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	for (i = 0; i < npages; ++i)
>>>> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  		unsigned long ioba, unsigned long tce)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>  	/* 	    liobn, ioba, tce); */
>>>> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  	if (ret != H_SUCCESS)
>>>>  		return ret;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
>>>> +				liobn, ioba, tce);
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>>>  
>>>>  	return H_SUCCESS;
>>>> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  		 * depend on hpt.
>>>>  		 */
>>>>  		struct mm_iommu_table_group_mem_t *mem;
>>>> +		struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>>>  			return H_TOO_HARD;
>>>> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>>>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>>>>  			return H_TOO_HARD;
>>>> +
>>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
>>>> +					stit->tbl, ioba, (u64 *)tces, npages);
>>>> +			if (ret != H_SUCCESS)
>>>> +				return ret;
>>>> +		}
>>>>  	} else {
>>>>  		/*
>>>>  		 * This is emulated devices case.
>>>> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>> +
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>>>  		return H_PARAMETER;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
>>>> +				liobn, ioba, tce_value, npages);
>>>> +		if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>>>  
>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>> index 70963c845e96..0e555ba998c0 100644
>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>> +		/* fallthrough */
>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>  	case KVM_CAP_PPC_RTAS:
>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>> index d32f239eb471..3181054c8ff7 100644
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -20,6 +20,10 @@
>>>>  #include <linux/vfio.h>
>>>>  #include "vfio.h"
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +#include <asm/kvm_ppc.h>
>>>> +#endif
>>>> +
>>>>  struct kvm_vfio_group {
>>>>  	struct list_head node;
>>>>  	struct vfio_group *vfio_group;
>>>> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>  	return ret > 0;
>>>>  }
>>>>  
>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>> +{
>>>> +	int (*fn)(struct vfio_group *);
>>>> +	int ret = -1;
>>>> +
>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>> +	if (!fn)
>>>> +		return ret;
>>>> +
>>>> +	ret = fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  /*
>>>>   * Groups can use the same or different IOMMU domains.  If the same then
>>>>   * adding a new group may change the coherency of groups we've previously
>>>> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>  
>>>>  		mutex_unlock(&kv->lock);
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>>>> +#endif
>>>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>>>  
>>>>  		kvm_vfio_group_put_external_user(vfio_group);
>>>> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>  		kvm_vfio_update_coherency(dev);
>>>>  
>>>>  		return ret;
>>>> +
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>>>> +		struct kvm_vfio_spapr_tce param;
>>>> +		unsigned long minsz;
>>>> +		struct kvm_vfio *kv = dev->private;
>>>> +		struct vfio_group *vfio_group;
>>>> +		struct kvm_vfio_group *kvg;
>>>> +		struct fd f;
>>>> +
>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>>>> +
>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (param.argsz < minsz || param.flags)
>>>> +			return -EINVAL;
>>>> +
>>>> +		f = fdget(param.groupfd);
>>>> +		if (!f.file)
>>>> +			return -EBADF;
>>>> +
>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>> +		fdput(f);
>>>> +
>>>> +		if (IS_ERR(vfio_group))
>>>> +			return PTR_ERR(vfio_group);
>>>> +
>>>> +		ret = -ENOENT;
>>>> +
>>>> +		mutex_lock(&kv->lock);
>>>> +
>>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>>>> +			int group_id;
>>>> +			struct iommu_group *grp;
>>>> +
>>>> +			if (kvg->vfio_group != vfio_group)
>>>> +				continue;
>>>> +
>>>> +			group_id = kvm_vfio_external_user_iommu_id(
>>>> +					kvg->vfio_group);
>>>> +			grp = iommu_group_get_by_id(group_id);
>>>> +			if (!grp) {
>>>> +				ret = -EFAULT;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>>>> +					param.tablefd, vfio_group, grp);
>>>> +
>>>> +			iommu_group_put(grp);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		mutex_unlock(&kv->lock);
>>>> +
>>>> +		return ret;
>>>> +	}
>>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>>>  	}
>>>
>>> Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
>>> path to detach the group from all LIOBNs,  Or else just fail if if
>>> there are LIOBNs attached.  I think it would be a qemu bug not to
>>> detach the LIOBNs before removing the group, but we stil need to
>>> protect the host in that case.
>>
>>
>> Yeah, this bit is a bit tricky/ugly.
>>
>> kvm_spapr_tce_liobn_release_iommu_group() (called from
>> kvm_spapr_tce_fops::release()) drops references when a group is removed
>> from the VFIO KVM device so there is no KVM_DEV_VFIO_GROUP_UNSET_SPAPR_TCE
>> and no action from QEMU is required.
> 
> IF qemu simply closes the group fd.  Which it does now, but might not
> always.  You still need to deal with the case where userspace does a
> KVM_DEV_VFIO_GROUP_DEL instead of closing the group fd.


This patch adds kvm_spapr_tce_release_iommu_group() to the
KVM_DEV_VFIO_GROUP_DEL handler.



> 
>>
>>
>>
>>>>  
>>>>  	return -ENXIO;
>>>> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>>>  		switch (attr->attr) {
>>>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>>>  		case KVM_DEV_VFIO_GROUP_DEL:
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>>>> +#endif
>>>>  			return 0;
>>>>  		}
>>>>  
>>>> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>>>  	struct kvm_vfio_group *kvg, *tmp;
>>>>  
>>>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>>>> +#endif
>>>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>  		list_del(&kvg->node);
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
  2017-01-13  2:23             ` Alexey Kardashevskiy
@ 2017-01-13  2:38               ` David Gibson
  -1 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2017-01-13  2:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 39635 bytes --]

On Fri, Jan 13, 2017 at 01:23:46PM +1100, Alexey Kardashevskiy wrote:
> On 13/01/17 10:53, David Gibson wrote:
> > On Thu, Jan 12, 2017 at 07:09:01PM +1100, Alexey Kardashevskiy wrote:
> >> On 12/01/17 16:04, David Gibson wrote:
> >>> On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
> >>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>>> without passing them to user space which saves time on switching
> >>>> to user space and back.
> >>>>
> >>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >>>> KVM tries to handle a TCE request in the real mode, if failed
> >>>> it passes the request to the virtual mode to complete the operation.
> >>>> If it a virtual mode handler fails, the request is passed to
> >>>> the user space; this is not expected to happen though.
> >>>>
> >>>> To avoid dealing with page use counters (which is tricky in real mode),
> >>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >>>> to pre-register the userspace memory. The very first TCE request will
> >>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >>>> of the TCE table (iommu_table::it_userspace) is not allocated till
> >>>> the very first mapping happens and we cannot call vmalloc in real mode.
> >>>>
> >>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >>>> and associates a physical IOMMU table with the SPAPR TCE table (which
> >>>> is a guest view of the hardware IOMMU table). The iommu_table object
> >>>> is cached and referenced so we do not have to look up for it in real mode.
> >>>>
> >>>> This does not implement the UNSET counterpart as there is no use for it -
> >>>> once the acceleration is enabled, the existing userspace won't
> >>>> disable it unless a VFIO container is destroyed; this adds necessary
> >>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>>>
> >>>> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >>>> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >>>> descriptors with the same iommu_table (hardware IOMMU table) attached
> >>>> to the same LIOBN, this is done to simplify the cleanup and can be
> >>>> improved later.
> >>>>
> >>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>> space.
> >>>>
> >>>> This finally makes use of vfio_external_user_iommu_id() which was
> >>>> introduced quite some time ago and was considered for removal.
> >>>>
> >>>> Tests show that this patch increases transmission speed from 220MB/s
> >>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>> Changes:
> >>>> v3:
> >>>> * simplified not to use VFIO group notifiers
> >>>> * reworked cleanup, should be cleaner/simpler now
> >>>>
> >>>> v2:
> >>>> * reworked to use new VFIO notifiers
> >>>> * now same iommu_table may appear in the list several times, to be fixed later
> >>>> ---
> >>>>
> >>>> This obsoletes:
> >>>>
> >>>> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
> >>>> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
> >>>> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
> >>>>
> >>>>
> >>>> So I have not reposted the whole thing, should have I?
> >>>>
> >>>>
> >>>> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
> >>>>
> >>>>
> >>>> ---
> >>>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
> >>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>>>  include/uapi/linux/kvm.h                   |   8 +
> >>>>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
> >>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>>>  virt/kvm/vfio.c                            |  88 +++++++++
> >>>>  8 files changed, 594 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> index ef51740c67ca..f95d867168ea 100644
> >>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> @@ -16,7 +16,25 @@ Groups:
> >>>>  
> >>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >>>> +	allocated by sPAPR KVM.
> >>>> +	kvm_device_attr.addr points to a struct:
> >>>>  
> >>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>> -for the VFIO group.
> >>>> +	struct kvm_vfio_spapr_tce {
> >>>> +		__u32	argsz;
> >>>> +		__u32	flags;
> >>>> +		__s32	groupfd;
> >>>> +		__s32	tablefd;
> >>>> +	};
> >>>> +
> >>>> +	where
> >>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>> +	@flags are not supported now, must be zero;
> >>>> +	@groupfd is a file descriptor for a VFIO group;
> >>>> +	@tablefd is a file descriptor for a TCE table allocated via
> >>>> +		KVM_CREATE_SPAPR_TCE.
> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >>>> index 28350a294b1e..3d281b7ea369 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_host.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_host.h
> >>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>>>  	atomic_t refcnt;
> >>>>  };
> >>>>  
> >>>> +struct kvmppc_spapr_tce_iommu_table {
> >>>> +	struct rcu_head rcu;
> >>>> +	struct list_head next;
> >>>> +	struct vfio_group *group;
> >>>> +	struct iommu_table *tbl;
> >>>> +};
> >>>> +
> >>>>  struct kvmppc_spapr_tce_table {
> >>>>  	struct list_head list;
> >>>>  	struct kvm *kvm;
> >>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>>>  	u32 page_shift;
> >>>>  	u64 offset;		/* in pages */
> >>>>  	u64 size;		/* window size in pages */
> >>>> +	struct list_head iommu_tables;
> >>>>  	struct page *pages[0];
> >>>>  };
> >>>>  
> >>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> index 0a21c8503974..936138b866e7 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >>>> +		struct vfio_group *group, struct iommu_group *grp);
> >>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >>>> +		struct vfio_group *group);
> >>>>  
> >>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  				struct kvm_create_spapr_tce_64 *args);
> >>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>> index 810f74317987..4088da4a575f 100644
> >>>> --- a/include/uapi/linux/kvm.h
> >>>> +++ b/include/uapi/linux/kvm.h
> >>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>>>  
> >>>>  enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_MAX,
> >>>>  };
> >>>>  
> >>>> +struct kvm_vfio_spapr_tce {
> >>>> +	__u32	argsz;
> >>>> +	__u32	flags;
> >>>> +	__s32	groupfd;
> >>>> +	__s32	tablefd;
> >>>> +};
> >>>> +
> >>>>  /*
> >>>>   * ioctls for VM fds
> >>>>   */
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..008c4aee4df6 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -27,6 +27,10 @@
> >>>>  #include <linux/hugetlb.h>
> >>>>  #include <linux/list.h>
> >>>>  #include <linux/anon_inodes.h>
> >>>> +#include <linux/iommu.h>
> >>>> +#include <linux/file.h>
> >>>> +#include <linux/vfio.h>
> >>>> +#include <linux/module.h>
> >>>>  
> >>>>  #include <asm/tlbflush.h>
> >>>>  #include <asm/kvm_ppc.h>
> >>>> @@ -39,6 +43,20 @@
> >>>>  #include <asm/udbg.h>
> >>>>  #include <asm/iommu.h>
> >>>>  #include <asm/tce.h>
> >>>> +#include <asm/mmu_context.h>
> >>>> +
> >>>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	void (*fn)(struct vfio_group *);
> >>>> +
> >>>> +	fn = symbol_get(vfio_group_put_external_user);
> >>>> +	if (!fn)
> >>>
> >>> I think this should have a WARN_ON().  If the vfio module is gone
> >>> while you still have VFIO groups attached to a KVM table, something
> >>> has gone horribly wrong.
> >>>
> >>>> +		return;
> >>>> +
> >>>> +	fn(vfio_group);
> >>>> +
> >>>> +	symbol_put(vfio_group_put_external_user);
> >>>> +}
> >>>>  
> >>>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >>>>  {
> >>>> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> >>>> +{
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> >>>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> >>>> +
> >>>> +	kfree(stit);
> >>>> +}
> >>>> +
> >>>> +static void kvm_spapr_tce_liobn_release_iommu_group(
> >>>> +		struct kvmppc_spapr_tce_table *stt,
> >>>> +		struct vfio_group *group)
> >>>> +{
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> >>>> +
> >>>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> >>>> +		if (group && (stit->group != group))
> >>>> +			continue;
> >>>> +
> >>>> +		list_del_rcu(&stit->next);
> >>>> +
> >>>> +		iommu_table_put(stit->tbl);
> >>>> +		kvm_vfio_group_put_external_user(stit->group);
> >>>> +
> >>>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >>>> +		struct vfio_group *group)
> >>>> +{
> >>>> +	struct kvmppc_spapr_tce_table *stt;
> >>>> +
> >>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> >>>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> >>>> +}
> >>>> +
> >>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >>>> +		struct vfio_group *group, struct iommu_group *grp)
> >>>
> >>> Isn't passing both the vfio_group and the iommu_group redundant?
> >>
> >> vfio_group struct is internal to the vfio driver and there is no API to get
> >> the iommu_group pointer from it.
> > 
> > But in the caller you *do* derive the group from the vfio group by
> > going via id (ugly, but workable I guess).  Why not fold that logic
> > into this function.
> 
> 
> I could, either way looks equally ugly-ish to me, rework it?

Yes please, it makes more sense having the ugly lookup within the function.

> >>>> +{
> >>>> +	struct kvmppc_spapr_tce_table *stt = NULL;
> >>>> +	bool found = false;
> >>>> +	struct iommu_table *tbl = NULL;
> >>>> +	struct iommu_table_group *table_group;
> >>>> +	long i;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>> +	struct fd f;
> >>>> +
> >>>> +	f = fdget(tablefd);
> >>>> +	if (!f.file)
> >>>> +		return -EBADF;
> >>>> +
> >>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >>>> +		if (stt == f.file->private_data) {
> >>>> +			found = true;
> >>>> +			break;
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	fdput(f);
> >>>> +
> >>>> +	if (!found)
> >>>> +		return -ENODEV;
> >>>
> >>> Not entirely sure if ENODEV is the right error, but I can't
> >>> immediately think of a better one.
> 
> 
> btw I still do not have a better candidate, can I keep this as is?

Yes, that's ok.

> >>>
> >>>> +	table_group = iommu_group_get_iommudata(grp);
> >>>> +	if (!table_group)
> >>>> +		return -EFAULT;
> >>>
> >>> EFAULT is usually only returned when you pass a syscall a bad pointer,
> >>> which doesn't look to be the case here.  What situation does this
> >>> error path actually represent?
> >>
> >>
> >> "something went terribly wrong".
> > 
> > In that case there should be a WARN_ON().  As long as it's something
> > terribly wrong that can't be the user's fault.
> > 
> >>>> +
> >>>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >>>> +		struct iommu_table *tbltmp = table_group->tables[i];
> >>>> +
> >>>> +		if (!tbltmp)
> >>>> +			continue;
> >>>> +
> >>>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> >>>> +				(tbltmp->it_offset == stt->offset)) {
> >>>> +			tbl = tbltmp;
> >>>> +			break;
> >>>> +		}
> >>>> +	}
> >>>> +	if (!tbl)
> >>>> +		return -ENODEV;
> >>>> +
> >>>> +	iommu_table_get(tbl);
> >>>> +
> >>>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> >>>> +	stit->tbl = tbl;
> >>>> +	stit->group = group;
> >>>> +
> >>>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> >>>
> >>> Won't this add a separate stit entry for each group attached to the
> >>> LIOBN, even if those groups share a single hardware iommu table -
> >>> which is the likely case if those groups have all been put into the
> >>> same container.
> >>
> >>
> >> Correct. I am planning on optimizing this later.
> > 
> > Hmm, ok.
> > 
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>>  static void release_spapr_tce_table(struct rcu_head *head)
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >>>> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>>>  
> >>>>  	list_del_rcu(&stt->list);
> >>>>  
> >>>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> >>>> +
> >>>>  	kvm_put_kvm(stt->kvm);
> >>>>  
> >>>>  	kvmppc_account_memlimit(
> >>>> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  	stt->offset = args->offset;
> >>>>  	stt->size = size;
> >>>>  	stt->kvm = kvm;
> >>>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >>>>  
> >>>>  	for (i = 0; i < npages; i++) {
> >>>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>>> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>> +
> >>>> +	if (!pua)
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> >>>> +	if (!mem)
> >>>> +		return H_HARDWARE;
> >>>
> >>> IIUC, this error represents trying to unmap a page from the vIOMMU,
> >>> and discovering that it wasn't preregistered in the first place, which
> >>> shouldn't happen.  So would a WARN_ON() make sense here as well as the
> >>> H_HARDWARE.
> >>>
> >>>> +	mm_iommu_mapped_dec(mem);
> >>>> +
> >>>> +	*pua = 0;
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	enum dma_data_direction dir = DMA_NONE;
> >>>> +	unsigned long hpa = 0;
> >>>> +
> >>>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	if (dir == DMA_NONE)
> >>>> +		return H_SUCCESS;
> >>>> +
> >>>> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >>>> +}
> >>>> +
> >>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >>>> +		unsigned long entry, unsigned long gpa,
> >>>> +		enum dma_data_direction dir)
> >>>> +{
> >>>> +	long ret;
> >>>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>> +	struct mm_iommu_table_group_mem_t *mem;
> >>>> +
> >>>> +	if (!pua)
> >>>> +		/* it_userspace allocation might be delayed */
> >>>> +		return H_TOO_HARD;
> >>>> +
> >>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >>>> +		return H_HARDWARE;
> >>>
> >>> This would represent the guest trying to map a mad GPA, yes?  In which
> >>> case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.
> >>>
> >>>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >>>> +	if (!mem)
> >>>> +		return H_HARDWARE;
> >>>
> >>> Here H_HARDWARE seems right. IIUC this represents the guest trying to
> >>> map an address which wasn't pre-registered.  That would indicate a bug
> >>> in qemu, which is hardware as far as the guest is concerned.
> >>>
> >>>> +
> >>>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >>>> +		return H_HARDWARE;
> >>>
> >>> Not sure what this case represents.
> >>
> >> Preregistered memory not being able to translate userspace address to a
> >> host physical. In virtual mode it is a simple bounds checker, in real mode
> >> it also includes vmalloc_to_phys() failure.
> > 
> > Ok.  This caller is virtual mode only, isn't it?  If we fail the
> > bounds check, that sounds like a WARN_ON() + H_HARDWARE, since it
> > means we've translated the GPA to an insane UA.
> > 
> > If the translation just fails, that sounds like an H_TOO_HARD.
> > 
> >>>> +	if (mm_iommu_mapped_inc(mem))
> >>>> +		return H_HARDWARE;
> >>>
> >>> Or this.
> >>
> >> A preregistered memory area is in a process of disposal, no new mappings
> >> are allowed.
> > 
> > Ok, again under control of qemu, so H_HARDWARE is reasonable.
> > 
> >>>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >>>> +	if (ret) {
> >>>> +		mm_iommu_mapped_dec(mem);
> >>>> +		return H_TOO_HARD;
> >>>> +	}
> >>>> +
> >>>> +	if (dir != DMA_NONE)
> >>>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >>>> +
> >>>> +	*pua = ua;
> >>>> +
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl,
> >>>> +		unsigned long liobn, unsigned long ioba,
> >>>> +		unsigned long tce)
> >>>> +{
> >>>> +	long idx, ret = H_HARDWARE;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> >>>> +
> >>>> +	/* Clear TCE */
> >>>> +	if (dir == DMA_NONE) {
> >>>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> >>>> +			return H_PARAMETER;
> >>>> +
> >>>> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
> >>>> +	}
> >>>> +
> >>>> +	/* Put TCE */
> >>>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> >>>> +		return H_PARAMETER;
> >>>> +
> >>>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> >>>> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
> >>>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long ioba,
> >>>> +		u64 __user *tces, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long i, ret, tce, gpa;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>
> >>> IIUC this is the virtual mode, not the real mode version.  In which
> >>> case you shouldn't be accessing tces[i] (a userspace pointeR) directly
> >>> bit should instead be using get_user().
> >>>
> >>>> +		if (iommu_tce_put_param_check(tbl, ioba +
> >>>> +				(i << tbl->it_page_shift), gpa))
> >>>> +			return H_PARAMETER;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(tces[i]);
> >>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +
> >>>> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
> >>>> +				iommu_tce_direction(tce));
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl,
> >>>> +		unsigned long liobn, unsigned long ioba,
> >>>> +		unsigned long tce_value, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long i;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +
> >>>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> >>>> +		return H_PARAMETER;
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i)
> >>>> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  		      unsigned long ioba, unsigned long tce)
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long ret;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>>>  	/* 	    liobn, ioba, tce); */
> >>>> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  	if (ret != H_SUCCESS)
> >>>>  		return ret;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>
> >>> As noted above, AFAICT there is one stit per group, rather than per
> >>> backend IOMMU table, so if there are multiple groups in the same
> >>> container (and therefore attached to the same LIOBN), won't this mean
> >>> we duplicate this operation a bunch of times?
> >>>
> >>>> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>>>  
> >>>>  	return H_SUCCESS;
> >>>> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	unsigned long entry, ua = 0;
> >>>>  	u64 __user *tces;
> >>>>  	u64 tce;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>>  	if (!stt)
> >>>> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  	tces = (u64 __user *) ua;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
> >>>> +				stit->tbl, ioba, tces, npages);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			goto unlock_exit;
> >>>
> >>> Hmm, I don't suppose you could simplify things by not having a
> >>> put_tce_indirect() version of the whole backend iommu mapping
> >>> function, but just a single-TCE version, and instead looping across
> >>> the backend IOMMU tables as you put each indirect entry in .
> >>>
> >>>> +	}
> >>>> +
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>>  		if (get_user(tce, tces + i)) {
> >>>>  			ret = H_TOO_HARD;
> >>>> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>>  	if (!stt)
> >>>> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>>>  		return H_PARAMETER;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
> >>>> +				tce_value, npages);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>>>  
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 8a6834e6e1c8..4d6f01712a6d 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>>>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> >>>>  }
> >>>>  
> >>>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>> +
> >>>> +	if (!pua)
> >>>> +		return H_SUCCESS;
> >>>
> >>> What case is this?  Not being able to find the userspace duesn't sound
> >>> like a success.
> >>>
> >>>> +	pua = (void *) vmalloc_to_phys(pua);
> >>>> +	if (!pua)
> >>>> +		return H_SUCCESS;
> >>>
> >>> And again..
> >>>
> >>>> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
> >>>> +	if (!mem)
> >>>> +		return H_HARDWARE;
> >>>
> >>> Should this have a WARN_ON?
> >>>
> >>>> +	mm_iommu_mapped_dec(mem);
> >>>> +
> >>>> +	*pua = 0;
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	enum dma_data_direction dir = DMA_NONE;
> >>>> +	unsigned long hpa = 0;
> >>>> +
> >>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	if (dir == DMA_NONE)
> >>>> +		return H_SUCCESS;
> >>>> +
> >>>> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> >>>> +}
> >>>> +
> >>>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> >>>> +		unsigned long entry, unsigned long gpa,
> >>>> +		enum dma_data_direction dir)
> >>>> +{
> >>>> +	long ret;
> >>>> +	unsigned long hpa = 0, ua;
> >>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>> +	struct mm_iommu_table_group_mem_t *mem;
> >>>> +
> >>>> +	if (!pua)
> >>>> +		/* it_userspace allocation might be delayed */
> >>>> +		return H_TOO_HARD;
> >>>> +
> >>>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> >>>> +		return H_HARDWARE;
> >>>
> >>> Again H_HARDWARE doesn't seem right here.
> >>>
> >>>> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
> >>>> +	if (!mem)
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	pua = (void *) vmalloc_to_phys(pua);
> >>>> +	if (!pua)
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	if (mm_iommu_mapped_inc(mem))
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >>>> +	if (ret) {
> >>>> +		mm_iommu_mapped_dec(mem);
> >>>> +		return H_TOO_HARD;
> >>>> +	}
> >>>> +
> >>>> +	if (dir != DMA_NONE)
> >>>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> >>>> +
> >>>> +	*pua = ua;
> >>>> +
> >>>> +	return 0;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> >>>> +
> >>>> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long liobn,
> >>>> +		unsigned long ioba, unsigned long tce)
> >>>> +{
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> >>>> +
> >>>> +	/* Clear TCE */
> >>>> +	if (dir == DMA_NONE) {
> >>>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> >>>> +			return H_PARAMETER;
> >>>> +
> >>>> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
> >>>> +	}
> >>>> +
> >>>> +	/* Put TCE */
> >>>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> >>>> +		return H_PARAMETER;
> >>>> +
> >>>> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long ioba,
> >>>> +		u64 *tces, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long i, ret, tce, gpa;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +
> >>>> +		if (iommu_tce_put_param_check(tbl, ioba +
> >>>> +				(i << tbl->it_page_shift), gpa))
> >>>> +			return H_PARAMETER;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(tces[i]);
> >>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +
> >>>> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
> >>>> +				iommu_tce_direction(tce));
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl,
> >>>> +		unsigned long liobn, unsigned long ioba,
> >>>> +		unsigned long tce_value, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long i;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +
> >>>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> >>>> +		return H_PARAMETER;
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i)
> >>>> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  		unsigned long ioba, unsigned long tce)
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long ret;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>>>  	/* 	    liobn, ioba, tce); */
> >>>> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  	if (ret != H_SUCCESS)
> >>>>  		return ret;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
> >>>> +				liobn, ioba, tce);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>>>  
> >>>>  	return H_SUCCESS;
> >>>> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		 * depend on hpt.
> >>>>  		 */
> >>>>  		struct mm_iommu_table_group_mem_t *mem;
> >>>> +		struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>>>  			return H_TOO_HARD;
> >>>> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>>>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> >>>>  			return H_TOO_HARD;
> >>>> +
> >>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
> >>>> +					stit->tbl, ioba, (u64 *)tces, npages);
> >>>> +			if (ret != H_SUCCESS)
> >>>> +				return ret;
> >>>> +		}
> >>>>  	} else {
> >>>>  		/*
> >>>>  		 * This is emulated devices case.
> >>>> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>> +
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>>  	if (!stt)
> >>>> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>>>  		return H_PARAMETER;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
> >>>> +				liobn, ioba, tce_value, npages);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>>>  
> >>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>> index 70963c845e96..0e555ba998c0 100644
> >>>> --- a/arch/powerpc/kvm/powerpc.c
> >>>> +++ b/arch/powerpc/kvm/powerpc.c
> >>>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>> +		/* fallthrough */
> >>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>  	case KVM_CAP_PPC_RTAS:
> >>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>> index d32f239eb471..3181054c8ff7 100644
> >>>> --- a/virt/kvm/vfio.c
> >>>> +++ b/virt/kvm/vfio.c
> >>>> @@ -20,6 +20,10 @@
> >>>>  #include <linux/vfio.h>
> >>>>  #include "vfio.h"
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +#include <asm/kvm_ppc.h>
> >>>> +#endif
> >>>> +
> >>>>  struct kvm_vfio_group {
> >>>>  	struct list_head node;
> >>>>  	struct vfio_group *vfio_group;
> >>>> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>  	return ret > 0;
> >>>>  }
> >>>>  
> >>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	int (*fn)(struct vfio_group *);
> >>>> +	int ret = -1;
> >>>> +
> >>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>> +	if (!fn)
> >>>> +		return ret;
> >>>> +
> >>>> +	ret = fn(vfio_group);
> >>>> +
> >>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>>  /*
> >>>>   * Groups can use the same or different IOMMU domains.  If the same then
> >>>>   * adding a new group may change the coherency of groups we've previously
> >>>> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  
> >>>>  		mutex_unlock(&kv->lock);
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> >>>> +#endif
> >>>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >>>>  
> >>>>  		kvm_vfio_group_put_external_user(vfio_group);
> >>>> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  		kvm_vfio_update_coherency(dev);
> >>>>  
> >>>>  		return ret;
> >>>> +
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> >>>> +		struct kvm_vfio_spapr_tce param;
> >>>> +		unsigned long minsz;
> >>>> +		struct kvm_vfio *kv = dev->private;
> >>>> +		struct vfio_group *vfio_group;
> >>>> +		struct kvm_vfio_group *kvg;
> >>>> +		struct fd f;
> >>>> +
> >>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> >>>> +
> >>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>> +			return -EFAULT;
> >>>> +
> >>>> +		if (param.argsz < minsz || param.flags)
> >>>> +			return -EINVAL;
> >>>> +
> >>>> +		f = fdget(param.groupfd);
> >>>> +		if (!f.file)
> >>>> +			return -EBADF;
> >>>> +
> >>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>> +		fdput(f);
> >>>> +
> >>>> +		if (IS_ERR(vfio_group))
> >>>> +			return PTR_ERR(vfio_group);
> >>>> +
> >>>> +		ret = -ENOENT;
> >>>> +
> >>>> +		mutex_lock(&kv->lock);
> >>>> +
> >>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
> >>>> +			int group_id;
> >>>> +			struct iommu_group *grp;
> >>>> +
> >>>> +			if (kvg->vfio_group != vfio_group)
> >>>> +				continue;
> >>>> +
> >>>> +			group_id = kvm_vfio_external_user_iommu_id(
> >>>> +					kvg->vfio_group);
> >>>> +			grp = iommu_group_get_by_id(group_id);
> >>>> +			if (!grp) {
> >>>> +				ret = -EFAULT;
> >>>> +				break;
> >>>> +			}
> >>>> +
> >>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >>>> +					param.tablefd, vfio_group, grp);
> >>>> +
> >>>> +			iommu_group_put(grp);
> >>>> +			break;
> >>>> +		}
> >>>> +
> >>>> +		mutex_unlock(&kv->lock);
> >>>> +
> >>>> +		return ret;
> >>>> +	}
> >>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>>>  	}
> >>>
> >>> Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
> >>> path to detach the group from all LIOBNs,  Or else just fail if if
> >>> there are LIOBNs attached.  I think it would be a qemu bug not to
> >>> detach the LIOBNs before removing the group, but we stil need to
> >>> protect the host in that case.
> >>
> >>
> >> Yeah, this bit is a bit tricky/ugly.
> >>
> >> kvm_spapr_tce_liobn_release_iommu_group() (called from
> >> kvm_spapr_tce_fops::release()) drops references when a group is removed
> >> from the VFIO KVM device so there is no KVM_DEV_VFIO_GROUP_UNSET_SPAPR_TCE
> >> and no action from QEMU is required.
> > 
> > IF qemu simply closes the group fd.  Which it does now, but might not
> > always.  You still need to deal with the case where userspace does a
> > KVM_DEV_VFIO_GROUP_DEL instead of closing the group fd.
> 
> 
> This patch adds kvm_spapr_tce_release_iommu_group() to the
> KVM_DEV_VFIO_GROUP_DEL handler.

It does? Oh.. I didn't spot that.

> 
> 
> 
> > 
> >>
> >>
> >>
> >>>>  
> >>>>  	return -ENXIO;
> >>>> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>>>  		switch (attr->attr) {
> >>>>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>>>  		case KVM_DEV_VFIO_GROUP_DEL:
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> >>>> +#endif
> >>>>  			return 0;
> >>>>  		}
> >>>>  
> >>>> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >>>>  	struct kvm_vfio_group *kvg, *tmp;
> >>>>  
> >>>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> >>>> +#endif
> >>>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >>>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>  		list_del(&kvg->node);
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH kernel v3] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2017-01-13  2:38               ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2017-01-13  2:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 39635 bytes --]

On Fri, Jan 13, 2017 at 01:23:46PM +1100, Alexey Kardashevskiy wrote:
> On 13/01/17 10:53, David Gibson wrote:
> > On Thu, Jan 12, 2017 at 07:09:01PM +1100, Alexey Kardashevskiy wrote:
> >> On 12/01/17 16:04, David Gibson wrote:
> >>> On Tue, Dec 20, 2016 at 05:52:29PM +1100, Alexey Kardashevskiy wrote:
> >>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>>> without passing them to user space which saves time on switching
> >>>> to user space and back.
> >>>>
> >>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >>>> KVM tries to handle a TCE request in the real mode, if failed
> >>>> it passes the request to the virtual mode to complete the operation.
> >>>> If it a virtual mode handler fails, the request is passed to
> >>>> the user space; this is not expected to happen though.
> >>>>
> >>>> To avoid dealing with page use counters (which is tricky in real mode),
> >>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >>>> to pre-register the userspace memory. The very first TCE request will
> >>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >>>> of the TCE table (iommu_table::it_userspace) is not allocated till
> >>>> the very first mapping happens and we cannot call vmalloc in real mode.
> >>>>
> >>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >>>> and associates a physical IOMMU table with the SPAPR TCE table (which
> >>>> is a guest view of the hardware IOMMU table). The iommu_table object
> >>>> is cached and referenced so we do not have to look up for it in real mode.
> >>>>
> >>>> This does not implement the UNSET counterpart as there is no use for it -
> >>>> once the acceleration is enabled, the existing userspace won't
> >>>> disable it unless a VFIO container is destroyed; this adds necessary
> >>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>>>
> >>>> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >>>> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >>>> descriptors with the same iommu_table (hardware IOMMU table) attached
> >>>> to the same LIOBN, this is done to simplify the cleanup and can be
> >>>> improved later.
> >>>>
> >>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>> space.
> >>>>
> >>>> This finally makes use of vfio_external_user_iommu_id() which was
> >>>> introduced quite some time ago and was considered for removal.
> >>>>
> >>>> Tests show that this patch increases transmission speed from 220MB/s
> >>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>> Changes:
> >>>> v3:
> >>>> * simplified not to use VFIO group notifiers
> >>>> * reworked cleanup, should be cleaner/simpler now
> >>>>
> >>>> v2:
> >>>> * reworked to use new VFIO notifiers
> >>>> * now same iommu_table may appear in the list several times, to be fixed later
> >>>> ---
> >>>>
> >>>> This obsoletes:
> >>>>
> >>>> [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table()
> >>>> [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group
> >>>> [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO
> >>>>
> >>>>
> >>>> So I have not reposted the whole thing, should have I?
> >>>>
> >>>>
> >>>> btw "F:     virt/kvm/vfio.*"  is missing MAINTAINERS.
> >>>>
> >>>>
> >>>> ---
> >>>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
> >>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>>>  include/uapi/linux/kvm.h                   |   8 +
> >>>>  arch/powerpc/kvm/book3s_64_vio.c           | 286 +++++++++++++++++++++++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 ++++++++++++++++++
> >>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>>>  virt/kvm/vfio.c                            |  88 +++++++++
> >>>>  8 files changed, 594 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> index ef51740c67ca..f95d867168ea 100644
> >>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> @@ -16,7 +16,25 @@ Groups:
> >>>>  
> >>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >>>> +	allocated by sPAPR KVM.
> >>>> +	kvm_device_attr.addr points to a struct:
> >>>>  
> >>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>> -for the VFIO group.
> >>>> +	struct kvm_vfio_spapr_tce {
> >>>> +		__u32	argsz;
> >>>> +		__u32	flags;
> >>>> +		__s32	groupfd;
> >>>> +		__s32	tablefd;
> >>>> +	};
> >>>> +
> >>>> +	where
> >>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>> +	@flags are not supported now, must be zero;
> >>>> +	@groupfd is a file descriptor for a VFIO group;
> >>>> +	@tablefd is a file descriptor for a TCE table allocated via
> >>>> +		KVM_CREATE_SPAPR_TCE.
> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >>>> index 28350a294b1e..3d281b7ea369 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_host.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_host.h
> >>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>>>  	atomic_t refcnt;
> >>>>  };
> >>>>  
> >>>> +struct kvmppc_spapr_tce_iommu_table {
> >>>> +	struct rcu_head rcu;
> >>>> +	struct list_head next;
> >>>> +	struct vfio_group *group;
> >>>> +	struct iommu_table *tbl;
> >>>> +};
> >>>> +
> >>>>  struct kvmppc_spapr_tce_table {
> >>>>  	struct list_head list;
> >>>>  	struct kvm *kvm;
> >>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>>>  	u32 page_shift;
> >>>>  	u64 offset;		/* in pages */
> >>>>  	u64 size;		/* window size in pages */
> >>>> +	struct list_head iommu_tables;
> >>>>  	struct page *pages[0];
> >>>>  };
> >>>>  
> >>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> index 0a21c8503974..936138b866e7 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >>>> +		struct vfio_group *group, struct iommu_group *grp);
> >>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >>>> +		struct vfio_group *group);
> >>>>  
> >>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  				struct kvm_create_spapr_tce_64 *args);
> >>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>> index 810f74317987..4088da4a575f 100644
> >>>> --- a/include/uapi/linux/kvm.h
> >>>> +++ b/include/uapi/linux/kvm.h
> >>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>>>  
> >>>>  enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_MAX,
> >>>>  };
> >>>>  
> >>>> +struct kvm_vfio_spapr_tce {
> >>>> +	__u32	argsz;
> >>>> +	__u32	flags;
> >>>> +	__s32	groupfd;
> >>>> +	__s32	tablefd;
> >>>> +};
> >>>> +
> >>>>  /*
> >>>>   * ioctls for VM fds
> >>>>   */
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..008c4aee4df6 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -27,6 +27,10 @@
> >>>>  #include <linux/hugetlb.h>
> >>>>  #include <linux/list.h>
> >>>>  #include <linux/anon_inodes.h>
> >>>> +#include <linux/iommu.h>
> >>>> +#include <linux/file.h>
> >>>> +#include <linux/vfio.h>
> >>>> +#include <linux/module.h>
> >>>>  
> >>>>  #include <asm/tlbflush.h>
> >>>>  #include <asm/kvm_ppc.h>
> >>>> @@ -39,6 +43,20 @@
> >>>>  #include <asm/udbg.h>
> >>>>  #include <asm/iommu.h>
> >>>>  #include <asm/tce.h>
> >>>> +#include <asm/mmu_context.h>
> >>>> +
> >>>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	void (*fn)(struct vfio_group *);
> >>>> +
> >>>> +	fn = symbol_get(vfio_group_put_external_user);
> >>>> +	if (!fn)
> >>>
> >>> I think this should have a WARN_ON().  If the vfio module is gone
> >>> while you still have VFIO groups attached to a KVM table, something
> >>> has gone horribly wrong.
> >>>
> >>>> +		return;
> >>>> +
> >>>> +	fn(vfio_group);
> >>>> +
> >>>> +	symbol_put(vfio_group_put_external_user);
> >>>> +}
> >>>>  
> >>>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >>>>  {
> >>>> @@ -90,6 +108,99 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> >>>> +{
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> >>>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> >>>> +
> >>>> +	kfree(stit);
> >>>> +}
> >>>> +
> >>>> +static void kvm_spapr_tce_liobn_release_iommu_group(
> >>>> +		struct kvmppc_spapr_tce_table *stt,
> >>>> +		struct vfio_group *group)
> >>>> +{
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> >>>> +
> >>>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> >>>> +		if (group && (stit->group != group))
> >>>> +			continue;
> >>>> +
> >>>> +		list_del_rcu(&stit->next);
> >>>> +
> >>>> +		iommu_table_put(stit->tbl);
> >>>> +		kvm_vfio_group_put_external_user(stit->group);
> >>>> +
> >>>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >>>> +		struct vfio_group *group)
> >>>> +{
> >>>> +	struct kvmppc_spapr_tce_table *stt;
> >>>> +
> >>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> >>>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> >>>> +}
> >>>> +
> >>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >>>> +		struct vfio_group *group, struct iommu_group *grp)
> >>>
> >>> Isn't passing both the vfio_group and the iommu_group redundant?
> >>
> >> vfio_group struct is internal to the vfio driver and there is no API to get
> >> the iommu_group pointer from it.
> > 
> > But in the caller you *do* derive the group from the vfio group by
> > going via id (ugly, but workable I guess).  Why not fold that logic
> > into this function.
> 
> 
> I could, either way looks equally ugly-ish to me, rework it?

Yes please, it makes more sense having the ugly lookup within the function.

> >>>> +{
> >>>> +	struct kvmppc_spapr_tce_table *stt = NULL;
> >>>> +	bool found = false;
> >>>> +	struct iommu_table *tbl = NULL;
> >>>> +	struct iommu_table_group *table_group;
> >>>> +	long i;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>> +	struct fd f;
> >>>> +
> >>>> +	f = fdget(tablefd);
> >>>> +	if (!f.file)
> >>>> +		return -EBADF;
> >>>> +
> >>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >>>> +		if (stt == f.file->private_data) {
> >>>> +			found = true;
> >>>> +			break;
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	fdput(f);
> >>>> +
> >>>> +	if (!found)
> >>>> +		return -ENODEV;
> >>>
> >>> Not entirely sure if ENODEV is the right error, but I can't
> >>> immediately think of a better one.
> 
> 
> btw I still do not have a better candidate, can I keep this as is?

Yes, that's ok.

> >>>
> >>>> +	table_group = iommu_group_get_iommudata(grp);
> >>>> +	if (!table_group)
> >>>> +		return -EFAULT;
> >>>
> >>> EFAULT is usually only returned when you pass a syscall a bad pointer,
> >>> which doesn't look to be the case here.  What situation does this
> >>> error path actually represent?
> >>
> >>
> >> "something went terribly wrong".
> > 
> > In that case there should be a WARN_ON().  As long as it's something
> > terribly wrong that can't be the user's fault.
> > 
> >>>> +
> >>>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >>>> +		struct iommu_table *tbltmp = table_group->tables[i];
> >>>> +
> >>>> +		if (!tbltmp)
> >>>> +			continue;
> >>>> +
> >>>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> >>>> +				(tbltmp->it_offset == stt->offset)) {
> >>>> +			tbl = tbltmp;
> >>>> +			break;
> >>>> +		}
> >>>> +	}
> >>>> +	if (!tbl)
> >>>> +		return -ENODEV;
> >>>> +
> >>>> +	iommu_table_get(tbl);
> >>>> +
> >>>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> >>>> +	stit->tbl = tbl;
> >>>> +	stit->group = group;
> >>>> +
> >>>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> >>>
> >>> Won't this add a separate stit entry for each group attached to the
> >>> LIOBN, even if those groups share a single hardware iommu table -
> >>> which is the likely case if those groups have all been put into the
> >>> same container.
> >>
> >>
> >> Correct. I am planning on optimizing this later.
> > 
> > Hmm, ok.
> > 
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>>  static void release_spapr_tce_table(struct rcu_head *head)
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >>>> @@ -132,6 +243,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>>>  
> >>>>  	list_del_rcu(&stt->list);
> >>>>  
> >>>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> >>>> +
> >>>>  	kvm_put_kvm(stt->kvm);
> >>>>  
> >>>>  	kvmppc_account_memlimit(
> >>>> @@ -181,6 +294,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  	stt->offset = args->offset;
> >>>>  	stt->size = size;
> >>>>  	stt->kvm = kvm;
> >>>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >>>>  
> >>>>  	for (i = 0; i < npages; i++) {
> >>>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>>> @@ -209,11 +323,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>> +
> >>>> +	if (!pua)
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> >>>> +	if (!mem)
> >>>> +		return H_HARDWARE;
> >>>
> >>> IIUC, this error represents trying to unmap a page from the vIOMMU,
> >>> and discovering that it wasn't preregistered in the first place, which
> >>> shouldn't happen.  So would a WARN_ON() make sense here as well as the
> >>> H_HARDWARE.
> >>>
> >>>> +	mm_iommu_mapped_dec(mem);
> >>>> +
> >>>> +	*pua = 0;
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	enum dma_data_direction dir = DMA_NONE;
> >>>> +	unsigned long hpa = 0;
> >>>> +
> >>>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	if (dir == DMA_NONE)
> >>>> +		return H_SUCCESS;
> >>>> +
> >>>> +	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >>>> +}
> >>>> +
> >>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >>>> +		unsigned long entry, unsigned long gpa,
> >>>> +		enum dma_data_direction dir)
> >>>> +{
> >>>> +	long ret;
> >>>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>> +	struct mm_iommu_table_group_mem_t *mem;
> >>>> +
> >>>> +	if (!pua)
> >>>> +		/* it_userspace allocation might be delayed */
> >>>> +		return H_TOO_HARD;
> >>>> +
> >>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >>>> +		return H_HARDWARE;
> >>>
> >>> This would represent the guest trying to map a mad GPA, yes?  In which
> >>> case H_HARDWARE doesn't seem right.  H_PARAMETER or H_PERMISSION, maybe.
> >>>
> >>>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >>>> +	if (!mem)
> >>>> +		return H_HARDWARE;
> >>>
> >>> Here H_HARDWARE seems right. IIUC this represents the guest trying to
> >>> map an address which wasn't pre-registered.  That would indicate a bug
> >>> in qemu, which is hardware as far as the guest is concerned.
> >>>
> >>>> +
> >>>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >>>> +		return H_HARDWARE;
> >>>
> >>> Not sure what this case represents.
> >>
> >> Preregistered memory not being able to translate userspace address to a
> >> host physical. In virtual mode it is a simple bounds checker, in real mode
> >> it also includes vmalloc_to_phys() failure.
> > 
> > Ok.  This caller is virtual mode only, isn't it?  If we fail the
> > bounds check, that sounds like a WARN_ON() + H_HARDWARE, since it
> > means we've translated the GPA to an insane UA.
> > 
> > If the translation just fails, that sounds like an H_TOO_HARD.
> > 
> >>>> +	if (mm_iommu_mapped_inc(mem))
> >>>> +		return H_HARDWARE;
> >>>
> >>> Or this.
> >>
> >> A preregistered memory area is in a process of disposal, no new mappings
> >> are allowed.
> > 
> > Ok, again under control of qemu, so H_HARDWARE is reasonable.
> > 
> >>>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >>>> +	if (ret) {
> >>>> +		mm_iommu_mapped_dec(mem);
> >>>> +		return H_TOO_HARD;
> >>>> +	}
> >>>> +
> >>>> +	if (dir != DMA_NONE)
> >>>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >>>> +
> >>>> +	*pua = ua;
> >>>> +
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl,
> >>>> +		unsigned long liobn, unsigned long ioba,
> >>>> +		unsigned long tce)
> >>>> +{
> >>>> +	long idx, ret = H_HARDWARE;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> >>>> +
> >>>> +	/* Clear TCE */
> >>>> +	if (dir == DMA_NONE) {
> >>>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> >>>> +			return H_PARAMETER;
> >>>> +
> >>>> +		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
> >>>> +	}
> >>>> +
> >>>> +	/* Put TCE */
> >>>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> >>>> +		return H_PARAMETER;
> >>>> +
> >>>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> >>>> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
> >>>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long ioba,
> >>>> +		u64 __user *tces, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long i, ret, tce, gpa;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>
> >>> IIUC this is the virtual mode, not the real mode version.  In which
> >>> case you shouldn't be accessing tces[i] (a userspace pointeR) directly
> >>> bit should instead be using get_user().
> >>>
> >>>> +		if (iommu_tce_put_param_check(tbl, ioba +
> >>>> +				(i << tbl->it_page_shift), gpa))
> >>>> +			return H_PARAMETER;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(tces[i]);
> >>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +
> >>>> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
> >>>> +				iommu_tce_direction(tce));
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl,
> >>>> +		unsigned long liobn, unsigned long ioba,
> >>>> +		unsigned long tce_value, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long i;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +
> >>>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> >>>> +		return H_PARAMETER;
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i)
> >>>> +		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  		      unsigned long ioba, unsigned long tce)
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long ret;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>>>  	/* 	    liobn, ioba, tce); */
> >>>> @@ -230,6 +494,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  	if (ret != H_SUCCESS)
> >>>>  		return ret;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>
> >>> As noted above, AFAICT there is one stit per group, rather than per
> >>> backend IOMMU table, so if there are multiple groups in the same
> >>> container (and therefore attached to the same LIOBN), won't this mean
> >>> we duplicate this operation a bunch of times?
> >>>
> >>>> +		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>>>  
> >>>>  	return H_SUCCESS;
> >>>> @@ -245,6 +515,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	unsigned long entry, ua = 0;
> >>>>  	u64 __user *tces;
> >>>>  	u64 tce;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>>  	if (!stt)
> >>>> @@ -272,6 +543,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  	tces = (u64 __user *) ua;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
> >>>> +				stit->tbl, ioba, tces, npages);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			goto unlock_exit;
> >>>
> >>> Hmm, I don't suppose you could simplify things by not having a
> >>> put_tce_indirect() version of the whole backend iommu mapping
> >>> function, but just a single-TCE version, and instead looping across
> >>> the backend IOMMU tables as you put each indirect entry in .
> >>>
> >>>> +	}
> >>>> +
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>>  		if (get_user(tce, tces + i)) {
> >>>>  			ret = H_TOO_HARD;
> >>>> @@ -299,6 +577,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>>  	if (!stt)
> >>>> @@ -312,6 +591,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>>>  		return H_PARAMETER;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
> >>>> +				tce_value, npages);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>>>  
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 8a6834e6e1c8..4d6f01712a6d 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>>>  	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> >>>>  }
> >>>>  
> >>>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>> +
> >>>> +	if (!pua)
> >>>> +		return H_SUCCESS;
> >>>
> >>> What case is this?  Not being able to find the userspace duesn't sound
> >>> like a success.
> >>>
> >>>> +	pua = (void *) vmalloc_to_phys(pua);
> >>>> +	if (!pua)
> >>>> +		return H_SUCCESS;
> >>>
> >>> And again..
> >>>
> >>>> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
> >>>> +	if (!mem)
> >>>> +		return H_HARDWARE;
> >>>
> >>> Should this have a WARN_ON?
> >>>
> >>>> +	mm_iommu_mapped_dec(mem);
> >>>> +
> >>>> +	*pua = 0;
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	enum dma_data_direction dir = DMA_NONE;
> >>>> +	unsigned long hpa = 0;
> >>>> +
> >>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	if (dir == DMA_NONE)
> >>>> +		return H_SUCCESS;
> >>>> +
> >>>> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> >>>> +}
> >>>> +
> >>>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> >>>> +		unsigned long entry, unsigned long gpa,
> >>>> +		enum dma_data_direction dir)
> >>>> +{
> >>>> +	long ret;
> >>>> +	unsigned long hpa = 0, ua;
> >>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>> +	struct mm_iommu_table_group_mem_t *mem;
> >>>> +
> >>>> +	if (!pua)
> >>>> +		/* it_userspace allocation might be delayed */
> >>>> +		return H_TOO_HARD;
> >>>> +
> >>>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> >>>> +		return H_HARDWARE;
> >>>
> >>> Again H_HARDWARE doesn't seem right here.
> >>>
> >>>> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
> >>>> +	if (!mem)
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	pua = (void *) vmalloc_to_phys(pua);
> >>>> +	if (!pua)
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	if (mm_iommu_mapped_inc(mem))
> >>>> +		return H_HARDWARE;
> >>>> +
> >>>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >>>> +	if (ret) {
> >>>> +		mm_iommu_mapped_dec(mem);
> >>>> +		return H_TOO_HARD;
> >>>> +	}
> >>>> +
> >>>> +	if (dir != DMA_NONE)
> >>>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> >>>> +
> >>>> +	*pua = ua;
> >>>> +
> >>>> +	return 0;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> >>>> +
> >>>> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long liobn,
> >>>> +		unsigned long ioba, unsigned long tce)
> >>>> +{
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> >>>> +
> >>>> +	/* Clear TCE */
> >>>> +	if (dir == DMA_NONE) {
> >>>> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> >>>> +			return H_PARAMETER;
> >>>> +
> >>>> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
> >>>> +	}
> >>>> +
> >>>> +	/* Put TCE */
> >>>> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> >>>> +		return H_PARAMETER;
> >>>> +
> >>>> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl, unsigned long ioba,
> >>>> +		u64 *tces, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long i, ret, tce, gpa;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +
> >>>> +		if (iommu_tce_put_param_check(tbl, ioba +
> >>>> +				(i << tbl->it_page_shift), gpa))
> >>>> +			return H_PARAMETER;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(tces[i]);
> >>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +
> >>>> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
> >>>> +				iommu_tce_direction(tce));
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> >>>> +		struct iommu_table *tbl,
> >>>> +		unsigned long liobn, unsigned long ioba,
> >>>> +		unsigned long tce_value, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long i;
> >>>> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> >>>> +
> >>>> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> >>>> +		return H_PARAMETER;
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i)
> >>>> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
> >>>> +
> >>>> +	return H_SUCCESS;
> >>>> +}
> >>>> +
> >>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  		unsigned long ioba, unsigned long tce)
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long ret;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>>>  	/* 	    liobn, ioba, tce); */
> >>>> @@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  	if (ret != H_SUCCESS)
> >>>>  		return ret;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
> >>>> +				liobn, ioba, tce);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>>>  
> >>>>  	return H_SUCCESS;
> >>>> @@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		 * depend on hpt.
> >>>>  		 */
> >>>>  		struct mm_iommu_table_group_mem_t *mem;
> >>>> +		struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>>>  			return H_TOO_HARD;
> >>>> @@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>>>  		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> >>>>  			return H_TOO_HARD;
> >>>> +
> >>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
> >>>> +					stit->tbl, ioba, (u64 *)tces, npages);
> >>>> +			if (ret != H_SUCCESS)
> >>>> +				return ret;
> >>>> +		}
> >>>>  	} else {
> >>>>  		/*
> >>>>  		 * This is emulated devices case.
> >>>> @@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>> +
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>>  	if (!stt)
> >>>> @@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>>>  		return H_PARAMETER;
> >>>>  
> >>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
> >>>> +				liobn, ioba, tce_value, npages);
> >>>> +		if (ret != H_SUCCESS)
> >>>> +			return ret;
> >>>> +	}
> >>>> +
> >>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>>>  
> >>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>> index 70963c845e96..0e555ba998c0 100644
> >>>> --- a/arch/powerpc/kvm/powerpc.c
> >>>> +++ b/arch/powerpc/kvm/powerpc.c
> >>>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>> +		/* fallthrough */
> >>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>  	case KVM_CAP_PPC_RTAS:
> >>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>> index d32f239eb471..3181054c8ff7 100644
> >>>> --- a/virt/kvm/vfio.c
> >>>> +++ b/virt/kvm/vfio.c
> >>>> @@ -20,6 +20,10 @@
> >>>>  #include <linux/vfio.h>
> >>>>  #include "vfio.h"
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +#include <asm/kvm_ppc.h>
> >>>> +#endif
> >>>> +
> >>>>  struct kvm_vfio_group {
> >>>>  	struct list_head node;
> >>>>  	struct vfio_group *vfio_group;
> >>>> @@ -89,6 +93,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>  	return ret > 0;
> >>>>  }
> >>>>  
> >>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	int (*fn)(struct vfio_group *);
> >>>> +	int ret = -1;
> >>>> +
> >>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>> +	if (!fn)
> >>>> +		return ret;
> >>>> +
> >>>> +	ret = fn(vfio_group);
> >>>> +
> >>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>>  /*
> >>>>   * Groups can use the same or different IOMMU domains.  If the same then
> >>>>   * adding a new group may change the coherency of groups we've previously
> >>>> @@ -211,6 +231,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  
> >>>>  		mutex_unlock(&kv->lock);
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> >>>> +#endif
> >>>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >>>>  
> >>>>  		kvm_vfio_group_put_external_user(vfio_group);
> >>>> @@ -218,6 +241,65 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  		kvm_vfio_update_coherency(dev);
> >>>>  
> >>>>  		return ret;
> >>>> +
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> >>>> +		struct kvm_vfio_spapr_tce param;
> >>>> +		unsigned long minsz;
> >>>> +		struct kvm_vfio *kv = dev->private;
> >>>> +		struct vfio_group *vfio_group;
> >>>> +		struct kvm_vfio_group *kvg;
> >>>> +		struct fd f;
> >>>> +
> >>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> >>>> +
> >>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>> +			return -EFAULT;
> >>>> +
> >>>> +		if (param.argsz < minsz || param.flags)
> >>>> +			return -EINVAL;
> >>>> +
> >>>> +		f = fdget(param.groupfd);
> >>>> +		if (!f.file)
> >>>> +			return -EBADF;
> >>>> +
> >>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>> +		fdput(f);
> >>>> +
> >>>> +		if (IS_ERR(vfio_group))
> >>>> +			return PTR_ERR(vfio_group);
> >>>> +
> >>>> +		ret = -ENOENT;
> >>>> +
> >>>> +		mutex_lock(&kv->lock);
> >>>> +
> >>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
> >>>> +			int group_id;
> >>>> +			struct iommu_group *grp;
> >>>> +
> >>>> +			if (kvg->vfio_group != vfio_group)
> >>>> +				continue;
> >>>> +
> >>>> +			group_id = kvm_vfio_external_user_iommu_id(
> >>>> +					kvg->vfio_group);
> >>>> +			grp = iommu_group_get_by_id(group_id);
> >>>> +			if (!grp) {
> >>>> +				ret = -EFAULT;
> >>>> +				break;
> >>>> +			}
> >>>> +
> >>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >>>> +					param.tablefd, vfio_group, grp);
> >>>> +
> >>>> +			iommu_group_put(grp);
> >>>> +			break;
> >>>> +		}
> >>>> +
> >>>> +		mutex_unlock(&kv->lock);
> >>>> +
> >>>> +		return ret;
> >>>> +	}
> >>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>>>  	}
> >>>
> >>> Don't you also need to add something to the KVM_DEV_VFIO_GROUP_DEL
> >>> path to detach the group from all LIOBNs,  Or else just fail if if
> >>> there are LIOBNs attached.  I think it would be a qemu bug not to
> >>> detach the LIOBNs before removing the group, but we stil need to
> >>> protect the host in that case.
> >>
> >>
> >> Yeah, this bit is a bit tricky/ugly.
> >>
> >> kvm_spapr_tce_liobn_release_iommu_group() (called from
> >> kvm_spapr_tce_fops::release()) drops references when a group is removed
> >> from the VFIO KVM device so there is no KVM_DEV_VFIO_GROUP_UNSET_SPAPR_TCE
> >> and no action from QEMU is required.
> > 
> > IF qemu simply closes the group fd.  Which it does now, but might not
> > always.  You still need to deal with the case where userspace does a
> > KVM_DEV_VFIO_GROUP_DEL instead of closing the group fd.
> 
> 
> This patch adds kvm_spapr_tce_release_iommu_group() to the
> KVM_DEV_VFIO_GROUP_DEL handler.

It does? Oh.. I didn't spot that.

> 
> 
> 
> > 
> >>
> >>
> >>
> >>>>  
> >>>>  	return -ENXIO;
> >>>> @@ -242,6 +324,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>>>  		switch (attr->attr) {
> >>>>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>>>  		case KVM_DEV_VFIO_GROUP_DEL:
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> >>>> +#endif
> >>>>  			return 0;
> >>>>  		}
> >>>>  
> >>>> @@ -257,6 +342,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >>>>  	struct kvm_vfio_group *kvg, *tmp;
> >>>>  
> >>>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> >>>> +#endif
> >>>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >>>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>  		list_del(&kvg->node);
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2017-01-13  2:39 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-18  1:28 [PATCH kernel v2 00/11] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2016-12-18  1:28 ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 01/11] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 02/11] powerpc/iommu: Cleanup iommu_table disposal Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 03/11] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 04/11] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-21  4:08   ` David Gibson
2016-12-21  4:08     ` David Gibson
2016-12-21  8:57     ` Alexey Kardashevskiy
2016-12-21  8:57       ` Alexey Kardashevskiy
2017-01-11  6:35       ` Alexey Kardashevskiy
2017-01-11  6:35         ` Alexey Kardashevskiy
2017-01-12  5:49         ` David Gibson
2017-01-12  5:49           ` David Gibson
2016-12-18  1:28 ` [PATCH kernel v2 06/11] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 07/11] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 08/11] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 09/11] vfio iommu: Add helpers to (un)register blocking notifiers per group Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-21  6:04   ` David Gibson
2016-12-21  6:04     ` David Gibson
2016-12-22  1:25     ` Alexey Kardashevskiy
2016-12-22  1:25       ` Alexey Kardashevskiy
2016-12-18  1:28 ` [PATCH kernel v2 10/11] vfio: Check for unregistered notifiers when group is actually released Alexey Kardashevskiy
2016-12-18  1:28   ` Alexey Kardashevskiy
2016-12-19 10:41   ` Jike Song
2016-12-19 10:41     ` Jike Song
2016-12-19 16:28     ` Alex Williamson
2016-12-19 16:28       ` Alex Williamson
2016-12-19 16:28       ` Alex Williamson
2016-12-19 22:41       ` Alexey Kardashevskiy
2016-12-19 22:41         ` Alexey Kardashevskiy
2016-12-18  1:29 ` [PATCH kernel v2 11/11] KVM: PPC: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
2016-12-18  1:29   ` Alexey Kardashevskiy
2016-12-20  6:52   ` [PATCH kernel v3] " Alexey Kardashevskiy
2016-12-20  6:52     ` Alexey Kardashevskiy
2016-12-20  6:52     ` Alexey Kardashevskiy
2017-01-12  5:04     ` David Gibson
2017-01-12  5:04       ` David Gibson
2017-01-12  8:09       ` Alexey Kardashevskiy
2017-01-12  8:09         ` Alexey Kardashevskiy
2017-01-12 23:53         ` David Gibson
2017-01-12 23:53           ` David Gibson
2017-01-13  2:23           ` Alexey Kardashevskiy
2017-01-13  2:23             ` Alexey Kardashevskiy
2017-01-13  2:38             ` David Gibson
2017-01-13  2:38               ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.