All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel 0/9] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2016-12-08  8:19 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on the "next" branch of
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git

I am not doing changelog here as it is 4 months since last respin
and I am sure everybody lost the context anyway, I tried to be
as detailed as I could in the very last patch, others are
pretty trivial anyway.

Please comment. Thanks.


Alexey Kardashevskiy (9):
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/iommu: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/iommu.h           |  12 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   7 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  49 ++++-
 arch/powerpc/kvm/book3s_64_vio.c           | 309 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 256 ++++++++++++++++++++++--
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  42 +++-
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            | 108 ++++++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 18 files changed, 828 insertions(+), 47 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH kernel 0/9] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2016-12-08  8:19 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on the "next" branch of
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git

I am not doing changelog here as it is 4 months since last respin
and I am sure everybody lost the context anyway, I tried to be
as detailed as I could in the very last patch, others are
pretty trivial anyway.

Please comment. Thanks.


Alexey Kardashevskiy (9):
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/iommu: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/iommu.h           |  12 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   7 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  49 ++++-
 arch/powerpc/kvm/book3s_64_vio.c           | 309 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 256 ++++++++++++++++++++++--
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  42 +++-
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            | 108 ++++++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 18 files changed, 828 insertions(+), 47 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 300ef255d1e0..810f74317987 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 300ef255d1e0..810f74317987 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 2/9] powerpc/iommu: Cleanup iommu_table disposal
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c               | 4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..6744a2771769 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5fcae29107e1..c4f9e812ca6c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2013,7 +2012,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2339,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2425,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c8823578a1b2..cbac08af400e 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 2/9] powerpc/iommu: Cleanup iommu_table disposal
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c               | 4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..6744a2771769 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5fcae29107e1..c4f9e812ca6c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2013,7 +2012,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2339,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2425,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c8823578a1b2..cbac08af400e 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 3/9] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 arch/powerpc/platforms/pseries/vio.c      |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..9de8bad1fdf9 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -114,6 +114,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 6744a2771769..d12496889ce9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c4f9e812ca6c..ea181f02bebd 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2197,7 +2197,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2291,7 +2291,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2337,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2423,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3393,7 +3393,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3420,7 +3420,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c6d554fe585c..471210913e42 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index dc2577fc5fbb..47f0501a94f9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 2c8fb3ec989e..41e8aa5c0d6a 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index cbac08af400e..be37905012f0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 3/9] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 arch/powerpc/platforms/pseries/vio.c      |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..9de8bad1fdf9 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -114,6 +114,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 6744a2771769..d12496889ce9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c4f9e812ca6c..ea181f02bebd 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2197,7 +2197,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2291,7 +2291,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2337,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2423,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3393,7 +3393,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3420,7 +3420,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c6d554fe585c..471210913e42 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index dc2577fc5fbb..47f0501a94f9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 2c8fb3ec989e..41e8aa5c0d6a 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index cbac08af400e..be37905012f0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 4/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 104bad029ce9..631d32f5937b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *pa;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	pa = (void *) vmalloc_to_phys(va);
+	if (!pa)
+		return -EFAULT;
+
+	*hpa = *pa | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 4/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 104bad029ce9..631d32f5937b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *pa;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	pa = (void *) vmalloc_to_phys(va);
+	if (!pa)
+		return -EFAULT;
+
+	*hpa = *pa | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 5/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL where declared
(not in this patch).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* updated the commit log with Paul's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index d461c440889a..a3be4bd6188f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+	return mm_iommu_preregistered(vcpu->kvm->mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
@@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (kvmppc_preregistered(vcpu)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
+			return H_TOO_HARD;
+	} else {
+		/*
+		 * This is emulated devices case.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 5/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL where declared
(not in this patch).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* updated the commit log with Paul's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index d461c440889a..a3be4bd6188f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+	return mm_iommu_preregistered(vcpu->kvm->mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
@@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (kvmppc_preregistered(vcpu)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
+			return H_TOO_HARD;
+	} else {
+		/*
+		 * This is emulated devices case.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 6/9] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9de8bad1fdf9..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d12496889ce9..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1022,6 +1022,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+			(*direction == DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ea181f02bebd..f2c2ab8fbb3e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1855,6 +1855,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1869,6 +1880,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1943,7 +1955,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -1999,6 +2011,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -2018,6 +2041,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 6/9] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9de8bad1fdf9..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d12496889ce9..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1022,6 +1022,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
+			(*direction = DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ea181f02bebd..f2c2ab8fbb3e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1855,6 +1855,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1869,6 +1880,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1943,7 +1955,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -1999,6 +2011,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -2018,6 +2041,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 7/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 7/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 8/9] KVM: PPC: Pass kvm* to kvmppc_find_table()
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f6e49640dbe1..0a21c8503974 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5a4438..15df8ae627d9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index a3be4bd6188f..8a6834e6e1c8 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -49,10 +49,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 8/9] KVM: PPC: Pass kvm* to kvmppc_find_table()
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f6e49640dbe1..0a21c8503974 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5a4438..15df8ae627d9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index a3be4bd6188f..8a6834e6e1c8 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -49,10 +49,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-08  8:19 ` Alexey Kardashevskiy
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is referenced so we do not have to retrieve in real mode when hypercall
happens.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is detroyed so this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This uses the kvm->lock mutex to protect against a race between
the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
release() callback.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   5 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            | 108 +++++++++++
 8 files changed, 630 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..ddb5a6512ab3 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,24 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__s32	groupfd;
+		__s32	tablefd;
+		__u8	pad[4];
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..94774503c70d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct iommu_table *tbl;
+	atomic_t refs;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..17b947a0060d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+				int tablefd,
+				struct iommu_group *grp);
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+				struct iommu_group *grp);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 810f74317987..9e4025724e28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1068,6 +1068,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1089,6 +1090,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__s32	groupfd;
+	__s32	tablefd;
+	__u8	pad[4];
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..f86d07781ee9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +41,7 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,6 +93,25 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_iommu_table_put(
+		struct kvmppc_spapr_tce_iommu_table *stit)
+{
+	iommu_table_put(stit->tbl);
+	if (atomic_dec_return(&stit->refs))
+		return;
+
+	list_del_rcu(&stit->next);
+	call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -130,8 +152,23 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 {
 	struct kvmppc_spapr_tce_table *stt = filp->private_data;
 
+	kick_all_cpus_sync();
 	list_del_rcu(&stt->list);
 
+	mutex_lock(&stt->kvm->lock);
+
+	while (!list_empty(&stt->iommu_tables)) {
+		struct kvmppc_spapr_tce_iommu_table *stit;
+
+		stit = list_first_entry(&stt->iommu_tables,
+				struct kvmppc_spapr_tce_iommu_table, next);
+
+		while (atomic_read(&stit->refs))
+			kvm_spapr_tce_iommu_table_put(stit);
+	}
+
+	mutex_unlock(&stt->kvm->lock);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -146,6 +183,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
 	.release	= kvm_spapr_tce_release,
 };
 
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+
+	f = fdget(tablefd);
+	if (!f.file)
+		return -EBADF;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt == f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found)
+		return -ENODEV;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		if ((tbltmp->it_page_shift == stt->page_shift) &&
+				(tbltmp->it_offset == stt->offset)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl)
+		return -ENODEV;
+
+	iommu_table_get(tbl);
+
+	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
+		if (stit->tbl == tbl) {
+			atomic_inc(&stit->refs);
+			return 0;
+		}
+	}
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	atomic_set(&stit->refs, 1);
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+	return 0;
+}
+
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+		struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	struct iommu_table_group *table_group;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return;
+
+	mutex_lock(&kvm->lock);
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
+			long i;
+
+			for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+				if (stit->tbl != table_group->tables[i])
+					continue;
+
+				kvm_spapr_tce_iommu_table_put(stit);
+			}
+		}
+	}
+
+	mutex_unlock(&kvm->lock);
+}
+
 long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				   struct kvm_create_spapr_tce_64 *args)
 {
@@ -181,6 +310,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +339,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +510,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -245,6 +531,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +559,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				stit->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -299,6 +593,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +607,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8a6834e6e1c8..4d6f01712a6d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_iommu_table *stit;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					stit->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70963c845e96..0e555ba998c0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 1dd087da6f31..e82182f9dea9 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 	return ret > 0;
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously
@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
 	mutex_unlock(&kv->lock);
 }
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
+		struct vfio_group *vfio_group)
+{
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
+	grp = iommu_group_get_by_id(group_id);
+	if (grp) {
+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
+		iommu_group_put(grp);
+	}
+}
+#endif
+
 static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 {
 	struct kvm_vfio *kv = dev->private;
@@ -185,6 +221,11 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 			if (kvg->vfio_group != vfio_group)
 				continue;
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
+					kvg->vfio_group);
+#endif
+
 			list_del(&kvg->node);
 			kvm_vfio_group_put_external_user(kvg->vfio_group);
 			kfree(kvg);
@@ -201,6 +242,66 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, pad);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, grp);
+			iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		kvm_vfio_group_put_external_user(vfio_group);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -225,6 +326,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -240,6 +344,10 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_vfio_spapr_detach_iommu_group(dev->kvm,
+				kvg->vfio_group);
+#endif
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
 		kfree(kvg);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-08  8:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-08  8:19 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is referenced so we do not have to retrieve in real mode when hypercall
happens.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is detroyed so this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This uses the kvm->lock mutex to protect against a race between
the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
release() callback.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   5 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            | 108 +++++++++++
 8 files changed, 630 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..ddb5a6512ab3 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,24 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__s32	groupfd;
+		__s32	tablefd;
+		__u8	pad[4];
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..94774503c70d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct iommu_table *tbl;
+	atomic_t refs;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..17b947a0060d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+				int tablefd,
+				struct iommu_group *grp);
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+				struct iommu_group *grp);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 810f74317987..9e4025724e28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1068,6 +1068,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1089,6 +1090,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__s32	groupfd;
+	__s32	tablefd;
+	__u8	pad[4];
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..f86d07781ee9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +41,7 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,6 +93,25 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_iommu_table_put(
+		struct kvmppc_spapr_tce_iommu_table *stit)
+{
+	iommu_table_put(stit->tbl);
+	if (atomic_dec_return(&stit->refs))
+		return;
+
+	list_del_rcu(&stit->next);
+	call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -130,8 +152,23 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 {
 	struct kvmppc_spapr_tce_table *stt = filp->private_data;
 
+	kick_all_cpus_sync();
 	list_del_rcu(&stt->list);
 
+	mutex_lock(&stt->kvm->lock);
+
+	while (!list_empty(&stt->iommu_tables)) {
+		struct kvmppc_spapr_tce_iommu_table *stit;
+
+		stit = list_first_entry(&stt->iommu_tables,
+				struct kvmppc_spapr_tce_iommu_table, next);
+
+		while (atomic_read(&stit->refs))
+			kvm_spapr_tce_iommu_table_put(stit);
+	}
+
+	mutex_unlock(&stt->kvm->lock);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -146,6 +183,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
 	.release	= kvm_spapr_tce_release,
 };
 
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+
+	f = fdget(tablefd);
+	if (!f.file)
+		return -EBADF;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt = f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found)
+		return -ENODEV;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		if ((tbltmp->it_page_shift = stt->page_shift) &&
+				(tbltmp->it_offset = stt->offset)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl)
+		return -ENODEV;
+
+	iommu_table_get(tbl);
+
+	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
+		if (stit->tbl = tbl) {
+			atomic_inc(&stit->refs);
+			return 0;
+		}
+	}
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	atomic_set(&stit->refs, 1);
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+	return 0;
+}
+
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+		struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	struct iommu_table_group *table_group;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return;
+
+	mutex_lock(&kvm->lock);
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
+			long i;
+
+			for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+				if (stit->tbl != table_group->tables[i])
+					continue;
+
+				kvm_spapr_tce_iommu_table_put(stit);
+			}
+		}
+	}
+
+	mutex_unlock(&kvm->lock);
+}
+
 long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				   struct kvm_create_spapr_tce_64 *args)
 {
@@ -181,6 +310,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +339,161 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir = DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +510,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_iommu(vcpu, stit->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -245,6 +531,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +559,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				stit->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -299,6 +593,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +607,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, stit->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8a6834e6e1c8..4d6f01712a6d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -190,11 +190,165 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir = DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -211,6 +365,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -278,6 +439,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_iommu_table *stit;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -285,6 +447,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					stit->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -334,6 +503,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -347,6 +518,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, stit->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70963c845e96..0e555ba998c0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 1dd087da6f31..e82182f9dea9 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 	return ret > 0;
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously
@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
 	mutex_unlock(&kv->lock);
 }
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
+		struct vfio_group *vfio_group)
+{
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
+	grp = iommu_group_get_by_id(group_id);
+	if (grp) {
+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
+		iommu_group_put(grp);
+	}
+}
+#endif
+
 static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 {
 	struct kvm_vfio *kv = dev->private;
@@ -185,6 +221,11 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 			if (kvg->vfio_group != vfio_group)
 				continue;
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
+					kvg->vfio_group);
+#endif
+
 			list_del(&kvg->node);
 			kvm_vfio_group_put_external_user(kvg->vfio_group);
 			kfree(kvg);
@@ -201,6 +242,66 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, pad);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, grp);
+			iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		kvm_vfio_group_put_external_user(vfio_group);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -225,6 +326,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -240,6 +344,10 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_vfio_spapr_detach_iommu_group(dev->kvm,
+				kvg->vfio_group);
+#endif
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
 		kfree(kvg);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-08 17:55     ` Alex Williamson
  -1 siblings, 0 replies; 47+ messages in thread
From: Alex Williamson @ 2016-12-08 17:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Thu,  8 Dec 2016 19:19:56 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is referenced so we do not have to retrieve in real mode when hypercall
> happens.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is detroyed so this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> This uses the kvm->lock mutex to protect against a race between
> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> release() callback.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            | 108 +++++++++++
>  8 files changed, 630 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..ddb5a6512ab3 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,24 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +		__u8	pad[4];
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 28350a294b1e..94774503c70d 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct iommu_table *tbl;
> +	atomic_t refs;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 0a21c8503974..17b947a0060d 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> +				int tablefd,
> +				struct iommu_group *grp);
> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> +				struct iommu_group *grp);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 810f74317987..9e4025724e28 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +	__u8	pad[4];
> +};

I assume you're implementing argsz and padding for future expansion,
but it doesn't really work.  Presumably argsz would be set to 16, so
the only way that the kernel will ever know something has changed would
be to make it bigger, so the padding bytes are really reserved, and
then it's not clear why we have padding at all.  If you replaced the
padding with a __u32 flags, then we could actually have some room to
architect expansion, but as it is we might as well drop both argsz and
padding.

> +
>  /*
>   * ioctls for VM fds
>   */
...
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>  	return ret > 0;
>  }
>  
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
> +
>  /*
>   * Groups can use the same or different IOMMU domains.  If the same then
>   * adding a new group may change the coherency of groups we've previously
> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>  	mutex_unlock(&kv->lock);
>  }
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> +		struct vfio_group *vfio_group)
> +{
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (grp) {
> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> +		iommu_group_put(grp);
> +	}
> +}
> +#endif


I wonder if you could use the new vfio group notifier to avoid tainting
this code with SPAPR_TCE #ifdefs.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-08 17:55     ` Alex Williamson
  0 siblings, 0 replies; 47+ messages in thread
From: Alex Williamson @ 2016-12-08 17:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Thu,  8 Dec 2016 19:19:56 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is referenced so we do not have to retrieve in real mode when hypercall
> happens.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is detroyed so this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> This uses the kvm->lock mutex to protect against a race between
> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> release() callback.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            | 108 +++++++++++
>  8 files changed, 630 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..ddb5a6512ab3 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,24 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +		__u8	pad[4];
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 28350a294b1e..94774503c70d 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct iommu_table *tbl;
> +	atomic_t refs;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 0a21c8503974..17b947a0060d 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> +				int tablefd,
> +				struct iommu_group *grp);
> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> +				struct iommu_group *grp);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 810f74317987..9e4025724e28 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +	__u8	pad[4];
> +};

I assume you're implementing argsz and padding for future expansion,
but it doesn't really work.  Presumably argsz would be set to 16, so
the only way that the kernel will ever know something has changed would
be to make it bigger, so the padding bytes are really reserved, and
then it's not clear why we have padding at all.  If you replaced the
padding with a __u32 flags, then we could actually have some room to
architect expansion, but as it is we might as well drop both argsz and
padding.

> +
>  /*
>   * ioctls for VM fds
>   */
...
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>  	return ret > 0;
>  }
>  
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
> +
>  /*
>   * Groups can use the same or different IOMMU domains.  If the same then
>   * adding a new group may change the coherency of groups we've previously
> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>  	mutex_unlock(&kv->lock);
>  }
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> +		struct vfio_group *vfio_group)
> +{
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (grp) {
> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> +		iommu_group_put(grp);
> +	}
> +}
> +#endif


I wonder if you could use the new vfio group notifier to avoid tainting
this code with SPAPR_TCE #ifdefs.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-08 17:55     ` Alex Williamson
@ 2016-12-09  7:53       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-09  7:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 09/12/16 04:55, Alex Williamson wrote:
> On Thu,  8 Dec 2016 19:19:56 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is referenced so we do not have to retrieve in real mode when hypercall
>> happens.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is detroyed so this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> This uses the kvm->lock mutex to protect against a race between
>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
>> release() callback.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            | 108 +++++++++++
>>  8 files changed, 630 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..ddb5a6512ab3 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,24 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +		__u8	pad[4];
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 28350a294b1e..94774503c70d 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct iommu_table *tbl;
>> +	atomic_t refs;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 0a21c8503974..17b947a0060d 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
>> +				int tablefd,
>> +				struct iommu_group *grp);
>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
>> +				struct iommu_group *grp);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 810f74317987..9e4025724e28 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +	__u8	pad[4];
>> +};
> 
> I assume you're implementing argsz and padding for future expansion,
> but it doesn't really work.  Presumably argsz would be set to 16, so
> the only way that the kernel will ever know something has changed would
> be to make it bigger, so the padding bytes are really reserved, and
> then it's not clear why we have padding at all.  If you replaced the
> padding with a __u32 flags, then we could actually have some room to
> architect expansion, but as it is we might as well drop both argsz and
> padding.

Ah, right, sorry, did not pay attention to this bit this time. I'll replace
pad with flags and move to argsz.


> 
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
> ...
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>  	return ret > 0;
>>  }
>>  
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>> +
>>  /*
>>   * Groups can use the same or different IOMMU domains.  If the same then
>>   * adding a new group may change the coherency of groups we've previously
>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>  	mutex_unlock(&kv->lock);
>>  }
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *vfio_group)
>> +{
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (grp) {
>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>> +		iommu_group_put(grp);
>> +	}
>> +}
>> +#endif
> 
> 
> I wonder if you could use the new vfio group notifier to avoid tainting
> this code with SPAPR_TCE #ifdefs.  Thanks,

I cannot see how... The new notifier sets kvm to a group, I need the
opposite - attach a group to kvm and not just to KVM but to a specific KVM
SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).

The only way I see how I can avoid tainting this code is adding another
ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
that few years ago and I was told to add a KVM device or even reuse VFIO
KVM device.

What am I missing here?


-- 
Alexey

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-09  7:53       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-09  7:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 09/12/16 04:55, Alex Williamson wrote:
> On Thu,  8 Dec 2016 19:19:56 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is referenced so we do not have to retrieve in real mode when hypercall
>> happens.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is detroyed so this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> This uses the kvm->lock mutex to protect against a race between
>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
>> release() callback.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            | 108 +++++++++++
>>  8 files changed, 630 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..ddb5a6512ab3 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,24 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +		__u8	pad[4];
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 28350a294b1e..94774503c70d 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct iommu_table *tbl;
>> +	atomic_t refs;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 0a21c8503974..17b947a0060d 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
>> +				int tablefd,
>> +				struct iommu_group *grp);
>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
>> +				struct iommu_group *grp);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 810f74317987..9e4025724e28 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +	__u8	pad[4];
>> +};
> 
> I assume you're implementing argsz and padding for future expansion,
> but it doesn't really work.  Presumably argsz would be set to 16, so
> the only way that the kernel will ever know something has changed would
> be to make it bigger, so the padding bytes are really reserved, and
> then it's not clear why we have padding at all.  If you replaced the
> padding with a __u32 flags, then we could actually have some room to
> architect expansion, but as it is we might as well drop both argsz and
> padding.

Ah, right, sorry, did not pay attention to this bit this time. I'll replace
pad with flags and move to argsz.


> 
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
> ...
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>  	return ret > 0;
>>  }
>>  
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>> +
>>  /*
>>   * Groups can use the same or different IOMMU domains.  If the same then
>>   * adding a new group may change the coherency of groups we've previously
>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>  	mutex_unlock(&kv->lock);
>>  }
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *vfio_group)
>> +{
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (grp) {
>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>> +		iommu_group_put(grp);
>> +	}
>> +}
>> +#endif
> 
> 
> I wonder if you could use the new vfio group notifier to avoid tainting
> this code with SPAPR_TCE #ifdefs.  Thanks,

I cannot see how... The new notifier sets kvm to a group, I need the
opposite - attach a group to kvm and not just to KVM but to a specific KVM
SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).

The only way I see how I can avoid tainting this code is adding another
ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
that few years ago and I was told to add a KVM device or even reuse VFIO
KVM device.

What am I missing here?


-- 
Alexey

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-09  7:53       ` Alexey Kardashevskiy
@ 2016-12-09 15:35         ` Alex Williamson
  -1 siblings, 0 replies; 47+ messages in thread
From: Alex Williamson @ 2016-12-09 15:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Fri, 9 Dec 2016 18:53:43 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 09/12/16 04:55, Alex Williamson wrote:
> > On Thu,  8 Dec 2016 19:19:56 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is referenced so we do not have to retrieve in real mode when hypercall
> >> happens.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is detroyed so this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> This uses the kvm->lock mutex to protect against a race between
> >> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> >> release() callback.
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            | 108 +++++++++++
> >>  8 files changed, 630 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..ddb5a6512ab3 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,24 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +		__u8	pad[4];
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +	@groupfd is a file descriptor for a VFIO group;
> >> +	@tablefd is a file descriptor for a TCE table allocated via
> >> +		KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 28350a294b1e..94774503c70d 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +	struct rcu_head rcu;
> >> +	struct list_head next;
> >> +	struct iommu_table *tbl;
> >> +	atomic_t refs;
> >> +};
> >> +
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >> +	struct list_head iommu_tables;
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index 0a21c8503974..17b947a0060d 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >> +				int tablefd,
> >> +				struct iommu_group *grp);
> >> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >> +				struct iommu_group *grp);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 810f74317987..9e4025724e28 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>  
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> +struct kvm_vfio_spapr_tce {
> >> +	__u32	argsz;
> >> +	__s32	groupfd;
> >> +	__s32	tablefd;
> >> +	__u8	pad[4];
> >> +};  
> > 
> > I assume you're implementing argsz and padding for future expansion,
> > but it doesn't really work.  Presumably argsz would be set to 16, so
> > the only way that the kernel will ever know something has changed would
> > be to make it bigger, so the padding bytes are really reserved, and
> > then it's not clear why we have padding at all.  If you replaced the
> > padding with a __u32 flags, then we could actually have some room to
> > architect expansion, but as it is we might as well drop both argsz and
> > padding.  
> 
> Ah, right, sorry, did not pay attention to this bit this time. I'll replace
> pad with flags and move to argsz.
> 
> 
> >   
> >> +
> >>  /*
> >>   * ioctls for VM fds
> >>   */  
> > ...  
> >> --- a/virt/kvm/vfio.c
> >> +++ b/virt/kvm/vfio.c
> >> @@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +#include <asm/kvm_ppc.h>
> >> +#endif
> >> +
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>  	return ret > 0;
> >>  }
> >>  
> >> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >> +{
> >> +	int (*fn)(struct vfio_group *);
> >> +	int ret = -1;
> >> +
> >> +	fn = symbol_get(vfio_external_user_iommu_id);
> >> +	if (!fn)
> >> +		return ret;
> >> +
> >> +	ret = fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_external_user_iommu_id);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  /*
> >>   * Groups can use the same or different IOMMU domains.  If the same then
> >>   * adding a new group may change the coherency of groups we've previously
> >> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>  	mutex_unlock(&kv->lock);
> >>  }
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *vfio_group)
> >> +{
> >> +	int group_id;
> >> +	struct iommu_group *grp;
> >> +
> >> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >> +	grp = iommu_group_get_by_id(group_id);
> >> +	if (grp) {
> >> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >> +		iommu_group_put(grp);
> >> +	}
> >> +}
> >> +#endif  
> > 
> > 
> > I wonder if you could use the new vfio group notifier to avoid tainting
> > this code with SPAPR_TCE #ifdefs.  Thanks,  
> 
> I cannot see how... The new notifier sets kvm to a group, I need the
> opposite - attach a group to kvm and not just to KVM but to a specific KVM
> SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).
> 
> The only way I see how I can avoid tainting this code is adding another
> ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
> that few years ago and I was told to add a KVM device or even reuse VFIO
> KVM device.
> 
> What am I missing here?

You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly
part of encompassing this all in #ifdefs is that we call
kvm_spapr_tce_{at,de}tach_iommu_group() directly.  The notifier would
sort of make it like an arch callback, vfio core would set these
attributes and broadcast them to notifier callbacks, on non-spapr-tce
platforms nobody would be listening for those notifications.
Ultimately I don't know how much cleaner it makes things, but it maybe
avoids spapr-tce #ifdefs leaking into every layer of the stack.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-09 15:35         ` Alex Williamson
  0 siblings, 0 replies; 47+ messages in thread
From: Alex Williamson @ 2016-12-09 15:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Fri, 9 Dec 2016 18:53:43 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 09/12/16 04:55, Alex Williamson wrote:
> > On Thu,  8 Dec 2016 19:19:56 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is referenced so we do not have to retrieve in real mode when hypercall
> >> happens.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is detroyed so this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> This uses the kvm->lock mutex to protect against a race between
> >> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> >> release() callback.
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            | 108 +++++++++++
> >>  8 files changed, 630 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..ddb5a6512ab3 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,24 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +		__u8	pad[4];
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +	@groupfd is a file descriptor for a VFIO group;
> >> +	@tablefd is a file descriptor for a TCE table allocated via
> >> +		KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 28350a294b1e..94774503c70d 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +	struct rcu_head rcu;
> >> +	struct list_head next;
> >> +	struct iommu_table *tbl;
> >> +	atomic_t refs;
> >> +};
> >> +
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >> +	struct list_head iommu_tables;
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index 0a21c8503974..17b947a0060d 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >> +				int tablefd,
> >> +				struct iommu_group *grp);
> >> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >> +				struct iommu_group *grp);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 810f74317987..9e4025724e28 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>  
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> +struct kvm_vfio_spapr_tce {
> >> +	__u32	argsz;
> >> +	__s32	groupfd;
> >> +	__s32	tablefd;
> >> +	__u8	pad[4];
> >> +};  
> > 
> > I assume you're implementing argsz and padding for future expansion,
> > but it doesn't really work.  Presumably argsz would be set to 16, so
> > the only way that the kernel will ever know something has changed would
> > be to make it bigger, so the padding bytes are really reserved, and
> > then it's not clear why we have padding at all.  If you replaced the
> > padding with a __u32 flags, then we could actually have some room to
> > architect expansion, but as it is we might as well drop both argsz and
> > padding.  
> 
> Ah, right, sorry, did not pay attention to this bit this time. I'll replace
> pad with flags and move to argsz.
> 
> 
> >   
> >> +
> >>  /*
> >>   * ioctls for VM fds
> >>   */  
> > ...  
> >> --- a/virt/kvm/vfio.c
> >> +++ b/virt/kvm/vfio.c
> >> @@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +#include <asm/kvm_ppc.h>
> >> +#endif
> >> +
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>  	return ret > 0;
> >>  }
> >>  
> >> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >> +{
> >> +	int (*fn)(struct vfio_group *);
> >> +	int ret = -1;
> >> +
> >> +	fn = symbol_get(vfio_external_user_iommu_id);
> >> +	if (!fn)
> >> +		return ret;
> >> +
> >> +	ret = fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_external_user_iommu_id);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  /*
> >>   * Groups can use the same or different IOMMU domains.  If the same then
> >>   * adding a new group may change the coherency of groups we've previously
> >> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>  	mutex_unlock(&kv->lock);
> >>  }
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *vfio_group)
> >> +{
> >> +	int group_id;
> >> +	struct iommu_group *grp;
> >> +
> >> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >> +	grp = iommu_group_get_by_id(group_id);
> >> +	if (grp) {
> >> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >> +		iommu_group_put(grp);
> >> +	}
> >> +}
> >> +#endif  
> > 
> > 
> > I wonder if you could use the new vfio group notifier to avoid tainting
> > this code with SPAPR_TCE #ifdefs.  Thanks,  
> 
> I cannot see how... The new notifier sets kvm to a group, I need the
> opposite - attach a group to kvm and not just to KVM but to a specific KVM
> SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).
> 
> The only way I see how I can avoid tainting this code is adding another
> ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
> that few years ago and I was told to add a KVM device or even reuse VFIO
> KVM device.
> 
> What am I missing here?

You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly
part of encompassing this all in #ifdefs is that we call
kvm_spapr_tce_{at,de}tach_iommu_group() directly.  The notifier would
sort of make it like an arch callback, vfio core would set these
attributes and broadcast them to notifier callbacks, on non-spapr-tce
platforms nobody would be listening for those notifications.
Ultimately I don't know how much cleaner it makes things, but it maybe
avoids spapr-tce #ifdefs leaking into every layer of the stack.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-12  4:08     ` David Gibson
  -1 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-12  4:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1497 bytes --]

On Thu, Dec 08, 2016 at 07:19:48PM +1100, Alexey Kardashevskiy wrote:
> This adds a capability number for in-kernel support for VFIO on
> SPAPR platform.
> 
> The capability will tell the user space whether in-kernel handlers of
> H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
> must not attempt allocating a TCE table in the host kernel via
> the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
> will not be passed to the user space which is desired action in
> the situation like that.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Fine as far as it goes, although I wonder if you actually need a new
CAP - couldn't you just add a new cap value to KVM_CAP_SPAPR_TCE?

> ---
>  include/uapi/linux/kvm.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 300ef255d1e0..810f74317987 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_S390_USER_INSTR0 130
>  #define KVM_CAP_MSI_DEVID 131
>  #define KVM_CAP_PPC_HTM 132
> +#define KVM_CAP_SPAPR_TCE_VFIO 133
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
@ 2016-12-12  4:08     ` David Gibson
  0 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-12  4:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1497 bytes --]

On Thu, Dec 08, 2016 at 07:19:48PM +1100, Alexey Kardashevskiy wrote:
> This adds a capability number for in-kernel support for VFIO on
> SPAPR platform.
> 
> The capability will tell the user space whether in-kernel handlers of
> H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
> must not attempt allocating a TCE table in the host kernel via
> the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
> will not be passed to the user space which is desired action in
> the situation like that.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Fine as far as it goes, although I wonder if you actually need a new
CAP - couldn't you just add a new cap value to KVM_CAP_SPAPR_TCE?

> ---
>  include/uapi/linux/kvm.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 300ef255d1e0..810f74317987 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_S390_USER_INSTR0 130
>  #define KVM_CAP_MSI_DEVID 131
>  #define KVM_CAP_PPC_HTM 132
> +#define KVM_CAP_SPAPR_TCE_VFIO 133
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 2/9] powerpc/iommu: Cleanup iommu_table disposal
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-12  4:15     ` David Gibson
  -1 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-12  4:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4010 bytes --]

On Thu, Dec 08, 2016 at 07:19:49PM +1100, Alexey Kardashevskiy wrote:
> At the moment iommu_table can be disposed by either calling
> iommu_table_free() directly or it_ops::free(); the only implementation
> of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
> iommu_table_free() anyway.
> 
> As we are going to have reference counting on tables, we need an unified
> way of disposing tables.
> 
> This moves it_ops::free() call into iommu_free_table() and makes use
> of the latter. The free() callback now handles only platform-specific
> data.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kernel/iommu.c               | 4 ++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
>  drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 5f202a566ec5..6744a2771769 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	if (!tbl)
>  		return;
>  
> +	if (tbl->it_ops->free)
> +		tbl->it_ops->free(tbl);
> +
>  	if (!tbl->it_map) {
>  		kfree(tbl);
>  		return;
> @@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> +EXPORT_SYMBOL_GPL(iommu_free_table);
>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5fcae29107e1..c4f9e812ca6c 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	pnv_pci_ioda2_table_free_pages(tbl);
>  	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
>  }
>  
> @@ -2013,7 +2012,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>  static void pnv_ioda2_table_free(struct iommu_table *tbl)
>  {
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
>  }
>  
>  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> @@ -2339,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		pnv_ioda2_table_free(tbl);
> +		iommu_free_table(tbl, "");
>  		return rc;
>  	}
>  
> @@ -2425,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> -	pnv_ioda2_table_free(tbl);
> +	iommu_free_table(tbl, "pnv");
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c8823578a1b2..cbac08af400e 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl, container->mm);
> -	tbl->it_ops->free(tbl);
> +	iommu_free_table(tbl, "");
>  	decrement_locked_vm(container->mm, pages);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 2/9] powerpc/iommu: Cleanup iommu_table disposal
@ 2016-12-12  4:15     ` David Gibson
  0 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-12  4:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4010 bytes --]

On Thu, Dec 08, 2016 at 07:19:49PM +1100, Alexey Kardashevskiy wrote:
> At the moment iommu_table can be disposed by either calling
> iommu_table_free() directly or it_ops::free(); the only implementation
> of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
> iommu_table_free() anyway.
> 
> As we are going to have reference counting on tables, we need an unified
> way of disposing tables.
> 
> This moves it_ops::free() call into iommu_free_table() and makes use
> of the latter. The free() callback now handles only platform-specific
> data.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kernel/iommu.c               | 4 ++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
>  drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 5f202a566ec5..6744a2771769 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	if (!tbl)
>  		return;
>  
> +	if (tbl->it_ops->free)
> +		tbl->it_ops->free(tbl);
> +
>  	if (!tbl->it_map) {
>  		kfree(tbl);
>  		return;
> @@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> +EXPORT_SYMBOL_GPL(iommu_free_table);
>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5fcae29107e1..c4f9e812ca6c 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	pnv_pci_ioda2_table_free_pages(tbl);
>  	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
>  }
>  
> @@ -2013,7 +2012,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>  static void pnv_ioda2_table_free(struct iommu_table *tbl)
>  {
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
>  }
>  
>  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> @@ -2339,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		pnv_ioda2_table_free(tbl);
> +		iommu_free_table(tbl, "");
>  		return rc;
>  	}
>  
> @@ -2425,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> -	pnv_ioda2_table_free(tbl);
> +	iommu_free_table(tbl, "pnv");
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c8823578a1b2..cbac08af400e 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl, container->mm);
> -	tbl->it_ops->free(tbl);
> +	iommu_free_table(tbl, "");
>  	decrement_locked_vm(container->mm, pages);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 3/9] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-12  4:18     ` David Gibson
  -1 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-12  4:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 9274 bytes --]

On Thu, Dec 08, 2016 at 07:19:50PM +1100, Alexey Kardashevskiy wrote:
> So far iommu_table obejcts were only used in virtual mode and had
> a single owner. We are going to change this by implementing in-kernel
> acceleration of DMA mapping requests. The proposed acceleration
> will handle requests in real mode and KVM will keep references to tables.
> 
> This adds a kref to iommu_table and defines new helpers to update it.
> This replaces iommu_free_table() with iommu_table_put() and makes
> iommu_free_table() static. iommu_table_get() is not used in this patch
> but it will be in the following patch.
> 
> Since this touches prototypes, this also removes @node_name parameter as
> it has never been really useful on powernv and carrying it for
> the pseries platform code to iommu_free_table() seems to be quite
> useless as well.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/iommu.h          |  5 +++--
>  arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
>  arch/powerpc/platforms/powernv/pci.c      |  1 +
>  arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
>  arch/powerpc/platforms/pseries/vio.c      |  2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
>  7 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 2c1d50792944..9de8bad1fdf9 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -114,6 +114,7 @@ struct iommu_table {
>  	struct list_head it_group_list;/* List of iommu_table_group_link */
>  	unsigned long *it_userspace; /* userspace view of the table */
>  	struct iommu_table_ops *it_ops;
> +	struct kref    it_kref;
>  };
>  
>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> @@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
>  
>  extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
>  
> -/* Frees table for an individual device node */
> -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> +extern void iommu_table_get(struct iommu_table *tbl);
> +extern void iommu_table_put(struct iommu_table *tbl);
>  
>  /* Initializes an iommu_table based in values set in the passed-in
>   * structure
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 6744a2771769..d12496889ce9 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
>  	return tbl;
>  }
>  
> -void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> +static void iommu_table_free(struct kref *kref)
>  {
>  	unsigned long bitmap_sz;
>  	unsigned int order;
> +	struct iommu_table *tbl;
>  
> -	if (!tbl)
> -		return;
> +	tbl = container_of(kref, struct iommu_table, it_kref);
>  
>  	if (tbl->it_ops->free)
>  		tbl->it_ops->free(tbl);
> @@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  
>  	/* verify that table contains no entries */
>  	if (!bitmap_empty(tbl->it_map, tbl->it_size))
> -		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
> +		pr_warn("%s: Unexpected TCEs\n", __func__);
>  
>  	/* calculate bitmap size in bytes */
>  	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> @@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> -EXPORT_SYMBOL_GPL(iommu_free_table);
> +
> +void iommu_table_get(struct iommu_table *tbl)
> +{
> +	kref_get(&tbl->it_kref);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_get);
> +
> +void iommu_table_put(struct iommu_table *tbl)
> +{
> +	if (!tbl)
> +		return;
> +
> +	kref_put(&tbl->it_kref, iommu_table_free);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_put);
>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c4f9e812ca6c..ea181f02bebd 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
> @@ -2197,7 +2197,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
>  		__free_pages(tce_mem, get_order(tce32_segsz * segs));
>  	if (tbl) {
>  		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  	}
>  }
>  
> @@ -2291,7 +2291,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  			bus_offset, page_shift, window_size,
>  			levels, tbl);
>  	if (ret) {
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  		return ret;
>  	}
>  
> @@ -2337,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		iommu_free_table(tbl, "");
> +		iommu_table_put(tbl);
>  		return rc;
>  	}
>  
> @@ -2423,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> @@ -3393,7 +3393,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
> @@ -3420,7 +3420,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index c6d554fe585c..471210913e42 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
>  
>  	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  
>  	return tbl;
>  }
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index dc2577fc5fbb..47f0501a94f9 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  		goto fail_exit;
>  
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  	tgl->table_group = table_group;
>  	list_add_rcu(&tgl->next, &tbl->it_group_list);
>  
> @@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
>  		BUG_ON(table_group->group);
>  	}
>  #endif
> -	iommu_free_table(tbl, node_name);
> +	iommu_table_put(tbl);
>  
>  	kfree(table_group);
>  }
> diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
> index 2c8fb3ec989e..41e8aa5c0d6a 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
>  	struct iommu_table *tbl = get_iommu_table_base(dev);
>  
>  	if (tbl)
> -		iommu_free_table(tbl, of_node_full_name(dev->of_node));
> +		iommu_table_put(tbl);
>  	of_node_put(dev->of_node);
>  	kfree(to_vio_dev(dev));
>  }
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index cbac08af400e..be37905012f0 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl, container->mm);
> -	iommu_free_table(tbl, "");
> +	iommu_table_put(tbl);
>  	decrement_locked_vm(container->mm, pages);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 3/9] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
@ 2016-12-12  4:18     ` David Gibson
  0 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-12  4:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 9274 bytes --]

On Thu, Dec 08, 2016 at 07:19:50PM +1100, Alexey Kardashevskiy wrote:
> So far iommu_table obejcts were only used in virtual mode and had
> a single owner. We are going to change this by implementing in-kernel
> acceleration of DMA mapping requests. The proposed acceleration
> will handle requests in real mode and KVM will keep references to tables.
> 
> This adds a kref to iommu_table and defines new helpers to update it.
> This replaces iommu_free_table() with iommu_table_put() and makes
> iommu_free_table() static. iommu_table_get() is not used in this patch
> but it will be in the following patch.
> 
> Since this touches prototypes, this also removes @node_name parameter as
> it has never been really useful on powernv and carrying it for
> the pseries platform code to iommu_free_table() seems to be quite
> useless as well.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/iommu.h          |  5 +++--
>  arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
>  arch/powerpc/platforms/powernv/pci.c      |  1 +
>  arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
>  arch/powerpc/platforms/pseries/vio.c      |  2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
>  7 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 2c1d50792944..9de8bad1fdf9 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -114,6 +114,7 @@ struct iommu_table {
>  	struct list_head it_group_list;/* List of iommu_table_group_link */
>  	unsigned long *it_userspace; /* userspace view of the table */
>  	struct iommu_table_ops *it_ops;
> +	struct kref    it_kref;
>  };
>  
>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> @@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
>  
>  extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
>  
> -/* Frees table for an individual device node */
> -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> +extern void iommu_table_get(struct iommu_table *tbl);
> +extern void iommu_table_put(struct iommu_table *tbl);
>  
>  /* Initializes an iommu_table based in values set in the passed-in
>   * structure
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 6744a2771769..d12496889ce9 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
>  	return tbl;
>  }
>  
> -void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> +static void iommu_table_free(struct kref *kref)
>  {
>  	unsigned long bitmap_sz;
>  	unsigned int order;
> +	struct iommu_table *tbl;
>  
> -	if (!tbl)
> -		return;
> +	tbl = container_of(kref, struct iommu_table, it_kref);
>  
>  	if (tbl->it_ops->free)
>  		tbl->it_ops->free(tbl);
> @@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  
>  	/* verify that table contains no entries */
>  	if (!bitmap_empty(tbl->it_map, tbl->it_size))
> -		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
> +		pr_warn("%s: Unexpected TCEs\n", __func__);
>  
>  	/* calculate bitmap size in bytes */
>  	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> @@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> -EXPORT_SYMBOL_GPL(iommu_free_table);
> +
> +void iommu_table_get(struct iommu_table *tbl)
> +{
> +	kref_get(&tbl->it_kref);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_get);
> +
> +void iommu_table_put(struct iommu_table *tbl)
> +{
> +	if (!tbl)
> +		return;
> +
> +	kref_put(&tbl->it_kref, iommu_table_free);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_put);
>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c4f9e812ca6c..ea181f02bebd 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
> @@ -2197,7 +2197,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
>  		__free_pages(tce_mem, get_order(tce32_segsz * segs));
>  	if (tbl) {
>  		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  	}
>  }
>  
> @@ -2291,7 +2291,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  			bus_offset, page_shift, window_size,
>  			levels, tbl);
>  	if (ret) {
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  		return ret;
>  	}
>  
> @@ -2337,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		iommu_free_table(tbl, "");
> +		iommu_table_put(tbl);
>  		return rc;
>  	}
>  
> @@ -2423,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> @@ -3393,7 +3393,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
> @@ -3420,7 +3420,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index c6d554fe585c..471210913e42 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
>  
>  	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  
>  	return tbl;
>  }
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index dc2577fc5fbb..47f0501a94f9 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  		goto fail_exit;
>  
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  	tgl->table_group = table_group;
>  	list_add_rcu(&tgl->next, &tbl->it_group_list);
>  
> @@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
>  		BUG_ON(table_group->group);
>  	}
>  #endif
> -	iommu_free_table(tbl, node_name);
> +	iommu_table_put(tbl);
>  
>  	kfree(table_group);
>  }
> diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
> index 2c8fb3ec989e..41e8aa5c0d6a 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
>  	struct iommu_table *tbl = get_iommu_table_base(dev);
>  
>  	if (tbl)
> -		iommu_free_table(tbl, of_node_full_name(dev->of_node));
> +		iommu_table_put(tbl);
>  	of_node_put(dev->of_node);
>  	kfree(to_vio_dev(dev));
>  }
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index cbac08af400e..be37905012f0 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl, container->mm);
> -	iommu_free_table(tbl, "");
> +	iommu_table_put(tbl);
>  	decrement_locked_vm(container->mm, pages);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-09 15:35         ` Alex Williamson
@ 2016-12-14  3:53           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-14  3:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 10/12/16 02:35, Alex Williamson wrote:
> On Fri, 9 Dec 2016 18:53:43 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 09/12/16 04:55, Alex Williamson wrote:
>>> On Thu,  8 Dec 2016 19:19:56 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>>>> KVM tries to handle a TCE request in the real mode, if failed
>>>> it passes the request to the virtual mode to complete the operation.
>>>> If it a virtual mode handler fails, the request is passed to
>>>> the user space; this is not expected to happen though.
>>>>
>>>> To avoid dealing with page use counters (which is tricky in real mode),
>>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>>>> to pre-register the userspace memory. The very first TCE request will
>>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>>>> of the TCE table (iommu_table::it_userspace) is not allocated till
>>>> the very first mapping happens and we cannot call vmalloc in real mode.
>>>>
>>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>>>> and associates a physical IOMMU table with the SPAPR TCE table (which
>>>> is a guest view of the hardware IOMMU table). The iommu_table object
>>>> is referenced so we do not have to retrieve in real mode when hypercall
>>>> happens.
>>>>
>>>> This does not implement the UNSET counterpart as there is no use for it -
>>>> once the acceleration is enabled, the existing userspace won't
>>>> disable it unless a VFIO container is detroyed so this adds necessary
>>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>>>
>>>> This uses the kvm->lock mutex to protect against a race between
>>>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
>>>> release() callback.
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> This finally makes use of vfio_external_user_iommu_id() which was
>>>> introduced quite some time ago and was considered for removal.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
>>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
>>>>  include/uapi/linux/kvm.h                   |   8 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
>>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>>>  virt/kvm/vfio.c                            | 108 +++++++++++
>>>>  8 files changed, 630 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740c67ca..ddb5a6512ab3 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,24 @@ Groups:
>>>>  
>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>>>> +	allocated by sPAPR KVM.
>>>> +	kvm_device_attr.addr points to a struct:
>>>>  
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +	struct kvm_vfio_spapr_tce {
>>>> +		__u32	argsz;
>>>> +		__s32	groupfd;
>>>> +		__s32	tablefd;
>>>> +		__u8	pad[4];
>>>> +	};
>>>> +
>>>> +	where
>>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>> +	@groupfd is a file descriptor for a VFIO group;
>>>> +	@tablefd is a file descriptor for a TCE table allocated via
>>>> +		KVM_CREATE_SPAPR_TCE.
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index 28350a294b1e..94774503c70d 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>>>  	atomic_t refcnt;
>>>>  };
>>>>  
>>>> +struct kvmppc_spapr_tce_iommu_table {
>>>> +	struct rcu_head rcu;
>>>> +	struct list_head next;
>>>> +	struct iommu_table *tbl;
>>>> +	atomic_t refs;
>>>> +};
>>>> +
>>>>  struct kvmppc_spapr_tce_table {
>>>>  	struct list_head list;
>>>>  	struct kvm *kvm;
>>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>>>  	u32 page_shift;
>>>>  	u64 offset;		/* in pages */
>>>>  	u64 size;		/* window size in pages */
>>>> +	struct list_head iommu_tables;
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index 0a21c8503974..17b947a0060d 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
>>>> +				int tablefd,
>>>> +				struct iommu_group *grp);
>>>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
>>>> +				struct iommu_group *grp);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce_64 *args);
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 810f74317987..9e4025724e28 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>>>  
>>>>  enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_MAX,
>>>>  };
>>>>  
>>>> +struct kvm_vfio_spapr_tce {
>>>> +	__u32	argsz;
>>>> +	__s32	groupfd;
>>>> +	__s32	tablefd;
>>>> +	__u8	pad[4];
>>>> +};  
>>>
>>> I assume you're implementing argsz and padding for future expansion,
>>> but it doesn't really work.  Presumably argsz would be set to 16, so
>>> the only way that the kernel will ever know something has changed would
>>> be to make it bigger, so the padding bytes are really reserved, and
>>> then it's not clear why we have padding at all.  If you replaced the
>>> padding with a __u32 flags, then we could actually have some room to
>>> architect expansion, but as it is we might as well drop both argsz and
>>> padding.  
>>
>> Ah, right, sorry, did not pay attention to this bit this time. I'll replace
>> pad with flags and move to argsz.
>>
>>
>>>   
>>>> +
>>>>  /*
>>>>   * ioctls for VM fds
>>>>   */  
>>> ...  
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -20,6 +20,10 @@
>>>>  #include <linux/vfio.h>
>>>>  #include "vfio.h"
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +#include <asm/kvm_ppc.h>
>>>> +#endif
>>>> +
>>>>  struct kvm_vfio_group {
>>>>  	struct list_head node;
>>>>  	struct vfio_group *vfio_group;
>>>> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>  	return ret > 0;
>>>>  }
>>>>  
>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>> +{
>>>> +	int (*fn)(struct vfio_group *);
>>>> +	int ret = -1;
>>>> +
>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>> +	if (!fn)
>>>> +		return ret;
>>>> +
>>>> +	ret = fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  /*
>>>>   * Groups can use the same or different IOMMU domains.  If the same then
>>>>   * adding a new group may change the coherency of groups we've previously
>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>>>  	mutex_unlock(&kv->lock);
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *vfio_group)
>>>> +{
>>>> +	int group_id;
>>>> +	struct iommu_group *grp;
>>>> +
>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>> +	grp = iommu_group_get_by_id(group_id);
>>>> +	if (grp) {
>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>> +		iommu_group_put(grp);
>>>> +	}
>>>> +}
>>>> +#endif  
>>>
>>>
>>> I wonder if you could use the new vfio group notifier to avoid tainting
>>> this code with SPAPR_TCE #ifdefs.  Thanks,  
>>
>> I cannot see how... The new notifier sets kvm to a group, I need the
>> opposite - attach a group to kvm and not just to KVM but to a specific KVM
>> SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).
>>
>> The only way I see how I can avoid tainting this code is adding another
>> ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
>> that few years ago and I was told to add a KVM device or even reuse VFIO
>> KVM device.
>>
>> What am I missing here?
> 
> You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly
> part of encompassing this all in #ifdefs is that we call
> kvm_spapr_tce_{at,de}tach_iommu_group() directly. The notifier would
> sort of make it like an arch callback, vfio core would set these
> attributes and broadcast them to notifier callbacks, on non-spapr-tce
> platforms nobody would be listening for those notifications.
> Ultimately I don't know how much cleaner it makes things, but it maybe
> avoids spapr-tce #ifdefs leaking into every layer of the stack.  Thanks,

I am failing here.

The normal workflow:
- create SPAPR TCE object in KVM, represents 1 LIOBN aka DMA window;
- attach IOMMU group to SPAPR TCE object, in this step the hardware tables
(1 or 2 iommu_table objects) are put to the SPAPR TCE's list of attached
tables; the tables are referenced.

When reboot happens, the SPAPR TCE object is destroyed and new guest starts
from the very beginning.


The task I am solving: dereference iommu_table (hardware table) in 2 cases:
1) guest reboot - SPAPR TCE table is destroyed but VFIO KVM device is still
there with all attached groups;
2) VFIO PCI hot unplug - SPAPR TCE table is there but groups need to be
detached from the VFIO KVM device.


Tried fixing 2) with the new callbacks:

- they do not take iommu_group, they take devices - fixed by duplicating
the vfio_(un)register_notifier API:
  * vfio_iommu_group_register_notifier()
  * vfio_iommu_group_unregister_notifier()
plus kvm wrappers with symbol_get/symbol_put in
arch/powerpc/kvm/book3s_64_vio.c.

- the callbacks are registered per a IOMMU group, the only place in this
code which knows about groups is SPAPR TCE driver but attach_group()
callback is called when IOMMU driver is not yet set ->
vfio_register_notifier() fails. KVM does not know about groups until
KVM_DEV_VFIO_GROUP_ADD/KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE so I register
callback in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE;

- the callback does not pass vfio_group pointer, only kvm; so my notifier
needs to be wrapped into a struct with a group pointer, ok, done;

- the notifier needs to be unregistered and the wrapper struct from the
previous step needs to be freed. No nice mechanisms for that - I cannot
unregister a notifier from a notifier itself. I fixed it by calling
rcu_sched() from the notifier when KVM==NULL and RCU-scheduled callback
calls vfio_iommu_group_unregister_notifier().

Looks quite ugly already but ok.


Now I am trying to solve 1). I can dereference iommu_table objects but
registered notifiers remain in memory and they are not freed so each guest
reboot increases the list length. I do not have a way to access vfio_group
structs from KVM, there I only have a list of iommu_table structs, each of
which has a list of iommu_group structs (this is done this way to make real
mode handlers possible) but there is no way to get to vfio_group struct
from iommu_group.

I can duplicate group list once again, this time is will be vfio_group list
attached to SPAPR TCE object but this all seems to be way to much, does not
it?...




-- 
Alexey

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-14  3:53           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-14  3:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 10/12/16 02:35, Alex Williamson wrote:
> On Fri, 9 Dec 2016 18:53:43 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 09/12/16 04:55, Alex Williamson wrote:
>>> On Thu,  8 Dec 2016 19:19:56 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>>>> KVM tries to handle a TCE request in the real mode, if failed
>>>> it passes the request to the virtual mode to complete the operation.
>>>> If it a virtual mode handler fails, the request is passed to
>>>> the user space; this is not expected to happen though.
>>>>
>>>> To avoid dealing with page use counters (which is tricky in real mode),
>>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>>>> to pre-register the userspace memory. The very first TCE request will
>>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>>>> of the TCE table (iommu_table::it_userspace) is not allocated till
>>>> the very first mapping happens and we cannot call vmalloc in real mode.
>>>>
>>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>>>> and associates a physical IOMMU table with the SPAPR TCE table (which
>>>> is a guest view of the hardware IOMMU table). The iommu_table object
>>>> is referenced so we do not have to retrieve in real mode when hypercall
>>>> happens.
>>>>
>>>> This does not implement the UNSET counterpart as there is no use for it -
>>>> once the acceleration is enabled, the existing userspace won't
>>>> disable it unless a VFIO container is detroyed so this adds necessary
>>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>>>
>>>> This uses the kvm->lock mutex to protect against a race between
>>>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
>>>> release() callback.
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> This finally makes use of vfio_external_user_iommu_id() which was
>>>> introduced quite some time ago and was considered for removal.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
>>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
>>>>  include/uapi/linux/kvm.h                   |   8 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
>>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>>>  virt/kvm/vfio.c                            | 108 +++++++++++
>>>>  8 files changed, 630 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740c67ca..ddb5a6512ab3 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,24 @@ Groups:
>>>>  
>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>>>> +	allocated by sPAPR KVM.
>>>> +	kvm_device_attr.addr points to a struct:
>>>>  
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +	struct kvm_vfio_spapr_tce {
>>>> +		__u32	argsz;
>>>> +		__s32	groupfd;
>>>> +		__s32	tablefd;
>>>> +		__u8	pad[4];
>>>> +	};
>>>> +
>>>> +	where
>>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>> +	@groupfd is a file descriptor for a VFIO group;
>>>> +	@tablefd is a file descriptor for a TCE table allocated via
>>>> +		KVM_CREATE_SPAPR_TCE.
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index 28350a294b1e..94774503c70d 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>>>  	atomic_t refcnt;
>>>>  };
>>>>  
>>>> +struct kvmppc_spapr_tce_iommu_table {
>>>> +	struct rcu_head rcu;
>>>> +	struct list_head next;
>>>> +	struct iommu_table *tbl;
>>>> +	atomic_t refs;
>>>> +};
>>>> +
>>>>  struct kvmppc_spapr_tce_table {
>>>>  	struct list_head list;
>>>>  	struct kvm *kvm;
>>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>>>  	u32 page_shift;
>>>>  	u64 offset;		/* in pages */
>>>>  	u64 size;		/* window size in pages */
>>>> +	struct list_head iommu_tables;
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index 0a21c8503974..17b947a0060d 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
>>>> +				int tablefd,
>>>> +				struct iommu_group *grp);
>>>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
>>>> +				struct iommu_group *grp);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce_64 *args);
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 810f74317987..9e4025724e28 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>>>  
>>>>  enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_MAX,
>>>>  };
>>>>  
>>>> +struct kvm_vfio_spapr_tce {
>>>> +	__u32	argsz;
>>>> +	__s32	groupfd;
>>>> +	__s32	tablefd;
>>>> +	__u8	pad[4];
>>>> +};  
>>>
>>> I assume you're implementing argsz and padding for future expansion,
>>> but it doesn't really work.  Presumably argsz would be set to 16, so
>>> the only way that the kernel will ever know something has changed would
>>> be to make it bigger, so the padding bytes are really reserved, and
>>> then it's not clear why we have padding at all.  If you replaced the
>>> padding with a __u32 flags, then we could actually have some room to
>>> architect expansion, but as it is we might as well drop both argsz and
>>> padding.  
>>
>> Ah, right, sorry, did not pay attention to this bit this time. I'll replace
>> pad with flags and move to argsz.
>>
>>
>>>   
>>>> +
>>>>  /*
>>>>   * ioctls for VM fds
>>>>   */  
>>> ...  
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -20,6 +20,10 @@
>>>>  #include <linux/vfio.h>
>>>>  #include "vfio.h"
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +#include <asm/kvm_ppc.h>
>>>> +#endif
>>>> +
>>>>  struct kvm_vfio_group {
>>>>  	struct list_head node;
>>>>  	struct vfio_group *vfio_group;
>>>> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>  	return ret > 0;
>>>>  }
>>>>  
>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>> +{
>>>> +	int (*fn)(struct vfio_group *);
>>>> +	int ret = -1;
>>>> +
>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>> +	if (!fn)
>>>> +		return ret;
>>>> +
>>>> +	ret = fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  /*
>>>>   * Groups can use the same or different IOMMU domains.  If the same then
>>>>   * adding a new group may change the coherency of groups we've previously
>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>>>  	mutex_unlock(&kv->lock);
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *vfio_group)
>>>> +{
>>>> +	int group_id;
>>>> +	struct iommu_group *grp;
>>>> +
>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>> +	grp = iommu_group_get_by_id(group_id);
>>>> +	if (grp) {
>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>> +		iommu_group_put(grp);
>>>> +	}
>>>> +}
>>>> +#endif  
>>>
>>>
>>> I wonder if you could use the new vfio group notifier to avoid tainting
>>> this code with SPAPR_TCE #ifdefs.  Thanks,  
>>
>> I cannot see how... The new notifier sets kvm to a group, I need the
>> opposite - attach a group to kvm and not just to KVM but to a specific KVM
>> SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).
>>
>> The only way I see how I can avoid tainting this code is adding another
>> ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
>> that few years ago and I was told to add a KVM device or even reuse VFIO
>> KVM device.
>>
>> What am I missing here?
> 
> You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly
> part of encompassing this all in #ifdefs is that we call
> kvm_spapr_tce_{at,de}tach_iommu_group() directly. The notifier would
> sort of make it like an arch callback, vfio core would set these
> attributes and broadcast them to notifier callbacks, on non-spapr-tce
> platforms nobody would be listening for those notifications.
> Ultimately I don't know how much cleaner it makes things, but it maybe
> avoids spapr-tce #ifdefs leaking into every layer of the stack.  Thanks,

I am failing here.

The normal workflow:
- create SPAPR TCE object in KVM, represents 1 LIOBN aka DMA window;
- attach IOMMU group to SPAPR TCE object, in this step the hardware tables
(1 or 2 iommu_table objects) are put to the SPAPR TCE's list of attached
tables; the tables are referenced.

When reboot happens, the SPAPR TCE object is destroyed and new guest starts
from the very beginning.


The task I am solving: dereference iommu_table (hardware table) in 2 cases:
1) guest reboot - SPAPR TCE table is destroyed but VFIO KVM device is still
there with all attached groups;
2) VFIO PCI hot unplug - SPAPR TCE table is there but groups need to be
detached from the VFIO KVM device.


Tried fixing 2) with the new callbacks:

- they do not take iommu_group, they take devices - fixed by duplicating
the vfio_(un)register_notifier API:
  * vfio_iommu_group_register_notifier()
  * vfio_iommu_group_unregister_notifier()
plus kvm wrappers with symbol_get/symbol_put in
arch/powerpc/kvm/book3s_64_vio.c.

- the callbacks are registered per a IOMMU group, the only place in this
code which knows about groups is SPAPR TCE driver but attach_group()
callback is called when IOMMU driver is not yet set ->
vfio_register_notifier() fails. KVM does not know about groups until
KVM_DEV_VFIO_GROUP_ADD/KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE so I register
callback in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE;

- the callback does not pass vfio_group pointer, only kvm; so my notifier
needs to be wrapped into a struct with a group pointer, ok, done;

- the notifier needs to be unregistered and the wrapper struct from the
previous step needs to be freed. No nice mechanisms for that - I cannot
unregister a notifier from a notifier itself. I fixed it by calling
rcu_sched() from the notifier when KVM=NULL and RCU-scheduled callback
calls vfio_iommu_group_unregister_notifier().

Looks quite ugly already but ok.


Now I am trying to solve 1). I can dereference iommu_table objects but
registered notifiers remain in memory and they are not freed so each guest
reboot increases the list length. I do not have a way to access vfio_group
structs from KVM, there I only have a list of iommu_table structs, each of
which has a list of iommu_group structs (this is done this way to make real
mode handlers possible) but there is no way to get to vfio_group struct
from iommu_group.

I can duplicate group list once again, this time is will be vfio_group list
attached to SPAPR TCE object but this all seems to be way to much, does not
it?...




-- 
Alexey

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-16  0:40     ` David Gibson
  -1 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  0:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4197 bytes --]

On Thu, Dec 08, 2016 at 07:19:51PM +1100, Alexey Kardashevskiy wrote:
> This makes mm_iommu_lookup() able to work in realmode by replacing
> list_for_each_entry_rcu() (which can do debug stuff which can fail in
> real mode) with list_for_each_entry_lockless().
> 
> This adds realmode version of mm_iommu_ua_to_hpa() which adds
> explicit vmalloc'd-to-linear address conversion.
> Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.
> 
> This changes mm_iommu_preregistered() to receive @mm as in real mode
> @current does not always have a correct pointer.
> 
> This adds realmode version of mm_iommu_lookup() which receives @mm
> (for the same reason as for mm_iommu_preregistered()) and uses
> lockless version of list_for_each_entry_rcu().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 ++++
>  arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index b9e3f0aca261..c70c8272523d 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
>  extern void mm_iommu_cleanup(struct mm_struct *mm);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  		unsigned long ua, unsigned long size);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
> +		struct mm_struct *mm, unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned long *hpa);
> +extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 104bad029ce9..631d32f5937b 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_lookup);
>  
> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
> +		unsigned long ua, unsigned long size)
> +{
> +	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
> +
> +	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
> +			next) {
> +		if ((mem->ua <= ua) &&
> +				(ua + size <= mem->ua +
> +				 (mem->entries << PAGE_SHIFT))) {
> +			ret = mem;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
> +
>  struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries)
>  {
> @@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
>  
> +long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa)
> +{
> +	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> +	void *va = &mem->hpas[entry];
> +	unsigned long *pa;
> +
> +	if (entry >= mem->entries)
> +		return -EFAULT;
> +
> +	pa = (void *) vmalloc_to_phys(va);
> +	if (!pa)
> +		return -EFAULT;
> +
> +	*hpa = *pa | (ua & ~PAGE_MASK);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
> +
>  long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	if (atomic64_inc_not_zero(&mem->mapped))

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
@ 2016-12-16  0:40     ` David Gibson
  0 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  0:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4197 bytes --]

On Thu, Dec 08, 2016 at 07:19:51PM +1100, Alexey Kardashevskiy wrote:
> This makes mm_iommu_lookup() able to work in realmode by replacing
> list_for_each_entry_rcu() (which can do debug stuff which can fail in
> real mode) with list_for_each_entry_lockless().
> 
> This adds realmode version of mm_iommu_ua_to_hpa() which adds
> explicit vmalloc'd-to-linear address conversion.
> Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.
> 
> This changes mm_iommu_preregistered() to receive @mm as in real mode
> @current does not always have a correct pointer.
> 
> This adds realmode version of mm_iommu_lookup() which receives @mm
> (for the same reason as for mm_iommu_preregistered()) and uses
> lockless version of list_for_each_entry_rcu().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 ++++
>  arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index b9e3f0aca261..c70c8272523d 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
>  extern void mm_iommu_cleanup(struct mm_struct *mm);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  		unsigned long ua, unsigned long size);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
> +		struct mm_struct *mm, unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned long *hpa);
> +extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 104bad029ce9..631d32f5937b 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_lookup);
>  
> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
> +		unsigned long ua, unsigned long size)
> +{
> +	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
> +
> +	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
> +			next) {
> +		if ((mem->ua <= ua) &&
> +				(ua + size <= mem->ua +
> +				 (mem->entries << PAGE_SHIFT))) {
> +			ret = mem;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
> +
>  struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries)
>  {
> @@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
>  
> +long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa)
> +{
> +	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> +	void *va = &mem->hpas[entry];
> +	unsigned long *pa;
> +
> +	if (entry >= mem->entries)
> +		return -EFAULT;
> +
> +	pa = (void *) vmalloc_to_phys(va);
> +	if (!pa)
> +		return -EFAULT;
> +
> +	*hpa = *pa | (ua & ~PAGE_MASK);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
> +
>  long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	if (atomic64_inc_not_zero(&mem->mapped))

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 5/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-16  0:57     ` David Gibson
  -1 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  0:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5008 bytes --]

On Thu, Dec 08, 2016 at 07:19:52PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL where declared
> (not in this patch).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>


Hrm.  So, pinning all of guest memory is the usual case, but nothing
in the pre-registration APIs actually guarantees that.  Now I think
this patch is still correct because..

> ---
> Changes:
> v2:
> * updated the commit log with Paul's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>  1 file changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index d461c440889a..a3be4bd6188f 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> +{
> +	return mm_iommu_preregistered(vcpu->kvm->mm);
> +}
> +
> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> +{
> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (kvmppc_preregistered(vcpu)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;

..this will fail if the relevant chunk of memory has not been
pre-registered and you'll fall back to the virtual mode version.  The
commit message doesn't make that terribly clear though.

> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> +			return H_TOO_HARD;
> +	} else {
> +		/*
> +		 * This is emulated devices case.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;

If I follow correctly you could also fall back to this path in the
failing case, but I guess there's probably not advantage to doing so.

> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)
> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 5/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-12-16  0:57     ` David Gibson
  0 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  0:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5008 bytes --]

On Thu, Dec 08, 2016 at 07:19:52PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL where declared
> (not in this patch).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>


Hrm.  So, pinning all of guest memory is the usual case, but nothing
in the pre-registration APIs actually guarantees that.  Now I think
this patch is still correct because..

> ---
> Changes:
> v2:
> * updated the commit log with Paul's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>  1 file changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index d461c440889a..a3be4bd6188f 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> +{
> +	return mm_iommu_preregistered(vcpu->kvm->mm);
> +}
> +
> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> +{
> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (kvmppc_preregistered(vcpu)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;

..this will fail if the relevant chunk of memory has not been
pre-registered and you'll fall back to the virtual mode version.  The
commit message doesn't make that terribly clear though.

> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> +			return H_TOO_HARD;
> +	} else {
> +		/*
> +		 * This is emulated devices case.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;

If I follow correctly you could also fall back to this path in the
failing case, but I guess there's probably not advantage to doing so.

> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)
> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 6/9] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-16  1:06     ` David Gibson
  -1 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  1:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5753 bytes --]

On Thu, Dec 08, 2016 at 07:19:53PM +1100, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using special
> cache-inhibited store instructions which are not available in
> virtual mode
> 
> This defines and implements exchange_rm() callback. This does not
> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> exchange/exchange_rm are only to be used by KVM for VFIO.
> 
> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> 
> This replaces list_for_each_entry_rcu with its lockless version as
> from now on pnv_pci_ioda2_tce_invalidate() can be called in
> the real mode too.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/iommu.h          |  7 +++++++
>  arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
>  3 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 9de8bad1fdf9..82e77ebf85f4 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -64,6 +64,11 @@ struct iommu_table_ops {
>  			long index,
>  			unsigned long *hpa,
>  			enum dma_data_direction *direction);
> +	/* Real mode */
> +	int (*exchange_rm)(struct iommu_table *tbl,
> +			long index,
> +			unsigned long *hpa,
> +			enum dma_data_direction *direction);
>  #endif
>  	void (*clear)(struct iommu_table *tbl,
>  			long index, long npages);
> @@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
>  extern int __init tce_iommu_bus_notifier_init(void);
>  extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>  		unsigned long *hpa, enum dma_data_direction *direction);
> +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction);
>  #else
>  static inline void iommu_register_group(struct iommu_table_group *table_group,
>  					int pci_domain_number,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index d12496889ce9..d02b8d22fb50 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1022,6 +1022,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>  }
>  EXPORT_SYMBOL_GPL(iommu_tce_xchg);
>  
> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret;
> +
> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +
> +	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> +			(*direction == DMA_BIDIRECTIONAL))) {
> +		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
> +
> +		if (likely(pg)) {
> +			SetPageDirty(pg);
> +		} else {
> +			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +			ret = -EFAULT;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
> +
>  int iommu_take_ownership(struct iommu_table *tbl)
>  {
>  	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index ea181f02bebd..f2c2ab8fbb3e 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1855,6 +1855,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret)
> +		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif
>  
>  static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> @@ -1869,6 +1880,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.set = pnv_ioda1_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda1_tce_xchg,
> +	.exchange_rm = pnv_ioda1_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda1_tce_free,
>  	.get = pnv_tce_get,
> @@ -1943,7 +1955,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>  {
>  	struct iommu_table_group_link *tgl;
>  
> -	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
> +	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
>  		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
>  				struct pnv_ioda_pe, table_group);
>  		struct pnv_phb *phb = pe->phb;
> @@ -1999,6 +2011,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret)
> +		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif
>  
>  static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> @@ -2018,6 +2041,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>  	.set = pnv_ioda2_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda2_tce_xchg,
> +	.exchange_rm = pnv_ioda2_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda2_tce_free,
>  	.get = pnv_tce_get,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 6/9] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
@ 2016-12-16  1:06     ` David Gibson
  0 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  1:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5753 bytes --]

On Thu, Dec 08, 2016 at 07:19:53PM +1100, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using special
> cache-inhibited store instructions which are not available in
> virtual mode
> 
> This defines and implements exchange_rm() callback. This does not
> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> exchange/exchange_rm are only to be used by KVM for VFIO.
> 
> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> 
> This replaces list_for_each_entry_rcu with its lockless version as
> from now on pnv_pci_ioda2_tce_invalidate() can be called in
> the real mode too.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/iommu.h          |  7 +++++++
>  arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
>  3 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 9de8bad1fdf9..82e77ebf85f4 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -64,6 +64,11 @@ struct iommu_table_ops {
>  			long index,
>  			unsigned long *hpa,
>  			enum dma_data_direction *direction);
> +	/* Real mode */
> +	int (*exchange_rm)(struct iommu_table *tbl,
> +			long index,
> +			unsigned long *hpa,
> +			enum dma_data_direction *direction);
>  #endif
>  	void (*clear)(struct iommu_table *tbl,
>  			long index, long npages);
> @@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
>  extern int __init tce_iommu_bus_notifier_init(void);
>  extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>  		unsigned long *hpa, enum dma_data_direction *direction);
> +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction);
>  #else
>  static inline void iommu_register_group(struct iommu_table_group *table_group,
>  					int pci_domain_number,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index d12496889ce9..d02b8d22fb50 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1022,6 +1022,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>  }
>  EXPORT_SYMBOL_GPL(iommu_tce_xchg);
>  
> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret;
> +
> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +
> +	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> +			(*direction == DMA_BIDIRECTIONAL))) {
> +		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
> +
> +		if (likely(pg)) {
> +			SetPageDirty(pg);
> +		} else {
> +			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +			ret = -EFAULT;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
> +
>  int iommu_take_ownership(struct iommu_table *tbl)
>  {
>  	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index ea181f02bebd..f2c2ab8fbb3e 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1855,6 +1855,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret)
> +		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif
>  
>  static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> @@ -1869,6 +1880,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.set = pnv_ioda1_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda1_tce_xchg,
> +	.exchange_rm = pnv_ioda1_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda1_tce_free,
>  	.get = pnv_tce_get,
> @@ -1943,7 +1955,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>  {
>  	struct iommu_table_group_link *tgl;
>  
> -	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
> +	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
>  		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
>  				struct pnv_ioda_pe, table_group);
>  		struct pnv_phb *phb = pe->phb;
> @@ -1999,6 +2011,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret)
> +		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif
>  
>  static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> @@ -2018,6 +2041,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>  	.set = pnv_ioda2_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda2_tce_xchg,
> +	.exchange_rm = pnv_ioda2_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda2_tce_free,
>  	.get = pnv_tce_get,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 7/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-16  1:11     ` David Gibson
  -1 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  1:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1442 bytes --]

On Thu, Dec 08, 2016 at 07:19:54PM +1100, Alexey Kardashevskiy wrote:
> It does not make much sense to have KVM in book3s-64 and
> not to have IOMMU bits for PCI pass through support as it costs little
> and allows VFIO to function on book3s KVM.
> 
> Having IOMMU_API always enabled makes it unnecessary to have a lot of
> "#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
> ifdef's we could have only user space emulated devices accelerated
> (but not VFIO) which do not seem to be very useful.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 029be26b5a17..65a471de96de 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -67,6 +67,7 @@ config KVM_BOOK3S_64
>  	select KVM_BOOK3S_64_HANDLER
>  	select KVM
>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> +	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>  	---help---
>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>  	  in virtual machines on book3s_64 host processors.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 7/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
@ 2016-12-16  1:11     ` David Gibson
  0 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  1:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1442 bytes --]

On Thu, Dec 08, 2016 at 07:19:54PM +1100, Alexey Kardashevskiy wrote:
> It does not make much sense to have KVM in book3s-64 and
> not to have IOMMU bits for PCI pass through support as it costs little
> and allows VFIO to function on book3s KVM.
> 
> Having IOMMU_API always enabled makes it unnecessary to have a lot of
> "#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
> ifdef's we could have only user space emulated devices accelerated
> (but not VFIO) which do not seem to be very useful.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 029be26b5a17..65a471de96de 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -67,6 +67,7 @@ config KVM_BOOK3S_64
>  	select KVM_BOOK3S_64_HANDLER
>  	select KVM
>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> +	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>  	---help---
>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>  	  in virtual machines on book3s_64 host processors.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Pass kvm* to kvmppc_find_table()
  2016-12-08  8:19   ` Alexey Kardashevskiy
@ 2016-12-16  1:32     ` David Gibson
  -1 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  1:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5185 bytes --]

On Thu, Dec 08, 2016 at 07:19:55PM +1100, Alexey Kardashevskiy wrote:
> The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
> there. This will be used in the following patches where we will be
> attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
> to VCPU).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
>  arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
>  3 files changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index f6e49640dbe1..0a21c8503974 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> -		struct kvm_vcpu *vcpu, unsigned long liobn);
> +		struct kvm *kvm, unsigned long liobn);
>  extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
>  		unsigned long ioba, unsigned long npages);
>  extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index c379ff5a4438..15df8ae627d9 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	u64 __user *tces;
>  	u64 tce;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index a3be4bd6188f..8a6834e6e1c8 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -49,10 +49,9 @@
>   * WARNING: This will be called in real or virtual mode on HV KVM and virtual
>   *          mode on PR KVM
>   */
> -struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
> +struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
>  		unsigned long liobn)
>  {
> -	struct kvm *kvm = vcpu->kvm;
>  	struct kvmppc_spapr_tce_table *stt;
>  
>  	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
> @@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  	unsigned long idx;
>  	struct page *page;
>  	u64 *tbl;
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Pass kvm* to kvmppc_find_table()
@ 2016-12-16  1:32     ` David Gibson
  0 siblings, 0 replies; 47+ messages in thread
From: David Gibson @ 2016-12-16  1:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5185 bytes --]

On Thu, Dec 08, 2016 at 07:19:55PM +1100, Alexey Kardashevskiy wrote:
> The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
> there. This will be used in the following patches where we will be
> attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
> to VCPU).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
>  arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
>  3 files changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index f6e49640dbe1..0a21c8503974 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> -		struct kvm_vcpu *vcpu, unsigned long liobn);
> +		struct kvm *kvm, unsigned long liobn);
>  extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
>  		unsigned long ioba, unsigned long npages);
>  extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index c379ff5a4438..15df8ae627d9 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	u64 __user *tces;
>  	u64 tce;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index a3be4bd6188f..8a6834e6e1c8 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -49,10 +49,9 @@
>   * WARNING: This will be called in real or virtual mode on HV KVM and virtual
>   *          mode on PR KVM
>   */
> -struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
> +struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
>  		unsigned long liobn)
>  {
> -	struct kvm *kvm = vcpu->kvm;
>  	struct kvmppc_spapr_tce_table *stt;
>  
>  	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
> @@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  	unsigned long idx;
>  	struct page *page;
>  	u64 *tbl;
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-12-14  3:53           ` Alexey Kardashevskiy
  (?)
@ 2016-12-19 17:27             ` Alex Williamson
  -1 siblings, 0 replies; 47+ messages in thread
From: Alex Williamson @ 2016-12-19 17:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, linuxppc-dev, kvm, kvm-ppc, David Gibson

On Wed, 14 Dec 2016 14:53:13 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 10/12/16 02:35, Alex Williamson wrote:
> > On Fri, 9 Dec 2016 18:53:43 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 09/12/16 04:55, Alex Williamson wrote:  
> >>> On Thu,  8 Dec 2016 19:19:56 +1100
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>     
> >>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>>> without passing them to user space which saves time on switching
> >>>> to user space and back.
> >>>>
> >>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >>>> KVM tries to handle a TCE request in the real mode, if failed
> >>>> it passes the request to the virtual mode to complete the operation.
> >>>> If it a virtual mode handler fails, the request is passed to
> >>>> the user space; this is not expected to happen though.
> >>>>
> >>>> To avoid dealing with page use counters (which is tricky in real mode),
> >>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >>>> to pre-register the userspace memory. The very first TCE request will
> >>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >>>> of the TCE table (iommu_table::it_userspace) is not allocated till
> >>>> the very first mapping happens and we cannot call vmalloc in real mode.
> >>>>
> >>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >>>> and associates a physical IOMMU table with the SPAPR TCE table (which
> >>>> is a guest view of the hardware IOMMU table). The iommu_table object
> >>>> is referenced so we do not have to retrieve in real mode when hypercall
> >>>> happens.
> >>>>
> >>>> This does not implement the UNSET counterpart as there is no use for it -
> >>>> once the acceleration is enabled, the existing userspace won't
> >>>> disable it unless a VFIO container is detroyed so this adds necessary
> >>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>>>
> >>>> This uses the kvm->lock mutex to protect against a race between
> >>>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> >>>> release() callback.
> >>>>
> >>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>> space.
> >>>>
> >>>> This finally makes use of vfio_external_user_iommu_id() which was
> >>>> introduced quite some time ago and was considered for removal.
> >>>>
> >>>> Tests show that this patch increases transmission speed from 220MB/s
> >>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
> >>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>>>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
> >>>>  include/uapi/linux/kvm.h                   |   8 +
> >>>>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
> >>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>>>  virt/kvm/vfio.c                            | 108 +++++++++++
> >>>>  8 files changed, 630 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> index ef51740c67ca..ddb5a6512ab3 100644
> >>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> @@ -16,7 +16,24 @@ Groups:
> >>>>  
> >>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >>>> +	allocated by sPAPR KVM.
> >>>> +	kvm_device_attr.addr points to a struct:
> >>>>  
> >>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>> -for the VFIO group.
> >>>> +	struct kvm_vfio_spapr_tce {
> >>>> +		__u32	argsz;
> >>>> +		__s32	groupfd;
> >>>> +		__s32	tablefd;
> >>>> +		__u8	pad[4];
> >>>> +	};
> >>>> +
> >>>> +	where
> >>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>> +	@groupfd is a file descriptor for a VFIO group;
> >>>> +	@tablefd is a file descriptor for a TCE table allocated via
> >>>> +		KVM_CREATE_SPAPR_TCE.
> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >>>> index 28350a294b1e..94774503c70d 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_host.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_host.h
> >>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>>>  	atomic_t refcnt;
> >>>>  };
> >>>>  
> >>>> +struct kvmppc_spapr_tce_iommu_table {
> >>>> +	struct rcu_head rcu;
> >>>> +	struct list_head next;
> >>>> +	struct iommu_table *tbl;
> >>>> +	atomic_t refs;
> >>>> +};
> >>>> +
> >>>>  struct kvmppc_spapr_tce_table {
> >>>>  	struct list_head list;
> >>>>  	struct kvm *kvm;
> >>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>>>  	u32 page_shift;
> >>>>  	u64 offset;		/* in pages */
> >>>>  	u64 size;		/* window size in pages */
> >>>> +	struct list_head iommu_tables;
> >>>>  	struct page *pages[0];
> >>>>  };
> >>>>  
> >>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> index 0a21c8503974..17b947a0060d 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >>>> +				int tablefd,
> >>>> +				struct iommu_group *grp);
> >>>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >>>> +				struct iommu_group *grp);
> >>>>  
> >>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  				struct kvm_create_spapr_tce_64 *args);
> >>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>> index 810f74317987..9e4025724e28 100644
> >>>> --- a/include/uapi/linux/kvm.h
> >>>> +++ b/include/uapi/linux/kvm.h
> >>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>>>  
> >>>>  enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_MAX,
> >>>>  };
> >>>>  
> >>>> +struct kvm_vfio_spapr_tce {
> >>>> +	__u32	argsz;
> >>>> +	__s32	groupfd;
> >>>> +	__s32	tablefd;
> >>>> +	__u8	pad[4];
> >>>> +};    
> >>>
> >>> I assume you're implementing argsz and padding for future expansion,
> >>> but it doesn't really work.  Presumably argsz would be set to 16, so
> >>> the only way that the kernel will ever know something has changed would
> >>> be to make it bigger, so the padding bytes are really reserved, and
> >>> then it's not clear why we have padding at all.  If you replaced the
> >>> padding with a __u32 flags, then we could actually have some room to
> >>> architect expansion, but as it is we might as well drop both argsz and
> >>> padding.    
> >>
> >> Ah, right, sorry, did not pay attention to this bit this time. I'll replace
> >> pad with flags and move to argsz.
> >>
> >>  
> >>>     
> >>>> +
> >>>>  /*
> >>>>   * ioctls for VM fds
> >>>>   */    
> >>> ...    
> >>>> --- a/virt/kvm/vfio.c
> >>>> +++ b/virt/kvm/vfio.c
> >>>> @@ -20,6 +20,10 @@
> >>>>  #include <linux/vfio.h>
> >>>>  #include "vfio.h"
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +#include <asm/kvm_ppc.h>
> >>>> +#endif
> >>>> +
> >>>>  struct kvm_vfio_group {
> >>>>  	struct list_head node;
> >>>>  	struct vfio_group *vfio_group;
> >>>> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>  	return ret > 0;
> >>>>  }
> >>>>  
> >>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	int (*fn)(struct vfio_group *);
> >>>> +	int ret = -1;
> >>>> +
> >>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>> +	if (!fn)
> >>>> +		return ret;
> >>>> +
> >>>> +	ret = fn(vfio_group);
> >>>> +
> >>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>>  /*
> >>>>   * Groups can use the same or different IOMMU domains.  If the same then
> >>>>   * adding a new group may change the coherency of groups we've previously
> >>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>>>  	mutex_unlock(&kv->lock);
> >>>>  }
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>> +		struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	int group_id;
> >>>> +	struct iommu_group *grp;
> >>>> +
> >>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>> +	grp = iommu_group_get_by_id(group_id);
> >>>> +	if (grp) {
> >>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>> +		iommu_group_put(grp);
> >>>> +	}
> >>>> +}
> >>>> +#endif    
> >>>
> >>>
> >>> I wonder if you could use the new vfio group notifier to avoid tainting
> >>> this code with SPAPR_TCE #ifdefs.  Thanks,    
> >>
> >> I cannot see how... The new notifier sets kvm to a group, I need the
> >> opposite - attach a group to kvm and not just to KVM but to a specific KVM
> >> SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).
> >>
> >> The only way I see how I can avoid tainting this code is adding another
> >> ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
> >> that few years ago and I was told to add a KVM device or even reuse VFIO
> >> KVM device.
> >>
> >> What am I missing here?  
> > 
> > You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly
> > part of encompassing this all in #ifdefs is that we call
> > kvm_spapr_tce_{at,de}tach_iommu_group() directly. The notifier would
> > sort of make it like an arch callback, vfio core would set these
> > attributes and broadcast them to notifier callbacks, on non-spapr-tce
> > platforms nobody would be listening for those notifications.
> > Ultimately I don't know how much cleaner it makes things, but it maybe
> > avoids spapr-tce #ifdefs leaking into every layer of the stack.  Thanks,  
> 
> I am failing here.
> 
> The normal workflow:
> - create SPAPR TCE object in KVM, represents 1 LIOBN aka DMA window;
> - attach IOMMU group to SPAPR TCE object, in this step the hardware tables
> (1 or 2 iommu_table objects) are put to the SPAPR TCE's list of attached
> tables; the tables are referenced.
> 
> When reboot happens, the SPAPR TCE object is destroyed and new guest starts
> from the very beginning.
> 
> 
> The task I am solving: dereference iommu_table (hardware table) in 2 cases:
> 1) guest reboot - SPAPR TCE table is destroyed but VFIO KVM device is still
> there with all attached groups;
> 2) VFIO PCI hot unplug - SPAPR TCE table is there but groups need to be
> detached from the VFIO KVM device.
> 
> 
> Tried fixing 2) with the new callbacks:
> 
> - they do not take iommu_group, they take devices - fixed by duplicating
> the vfio_(un)register_notifier API:
>   * vfio_iommu_group_register_notifier()
>   * vfio_iommu_group_unregister_notifier()
> plus kvm wrappers with symbol_get/symbol_put in
> arch/powerpc/kvm/book3s_64_vio.c.
> 
> - the callbacks are registered per a IOMMU group, the only place in this
> code which knows about groups is SPAPR TCE driver but attach_group()
> callback is called when IOMMU driver is not yet set ->
> vfio_register_notifier() fails. KVM does not know about groups until
> KVM_DEV_VFIO_GROUP_ADD/KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE so I register
> callback in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE;
> 
> - the callback does not pass vfio_group pointer, only kvm; so my notifier
> needs to be wrapped into a struct with a group pointer, ok, done;
> 
> - the notifier needs to be unregistered and the wrapper struct from the
> previous step needs to be freed. No nice mechanisms for that - I cannot
> unregister a notifier from a notifier itself. I fixed it by calling
> rcu_sched() from the notifier when KVM==NULL and RCU-scheduled callback
> calls vfio_iommu_group_unregister_notifier().
> 
> Looks quite ugly already but ok.
> 
> 
> Now I am trying to solve 1). I can dereference iommu_table objects but
> registered notifiers remain in memory and they are not freed so each guest
> reboot increases the list length. I do not have a way to access vfio_group
> structs from KVM, there I only have a list of iommu_table structs, each of
> which has a list of iommu_group structs (this is done this way to make real
> mode handlers possible) but there is no way to get to vfio_group struct
> from iommu_group.
> 
> I can duplicate group list once again, this time is will be vfio_group list
> attached to SPAPR TCE object but this all seems to be way to much, does not
> it?...

Ok, thanks for trying.  It does seem like it gets pretty complicated,
too bad.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-19 17:27             ` Alex Williamson
  0 siblings, 0 replies; 47+ messages in thread
From: Alex Williamson @ 2016-12-19 17:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Wed, 14 Dec 2016 14:53:13 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 10/12/16 02:35, Alex Williamson wrote:
> > On Fri, 9 Dec 2016 18:53:43 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 09/12/16 04:55, Alex Williamson wrote:  
> >>> On Thu,  8 Dec 2016 19:19:56 +1100
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>     
> >>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>>> without passing them to user space which saves time on switching
> >>>> to user space and back.
> >>>>
> >>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >>>> KVM tries to handle a TCE request in the real mode, if failed
> >>>> it passes the request to the virtual mode to complete the operation.
> >>>> If it a virtual mode handler fails, the request is passed to
> >>>> the user space; this is not expected to happen though.
> >>>>
> >>>> To avoid dealing with page use counters (which is tricky in real mode),
> >>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >>>> to pre-register the userspace memory. The very first TCE request will
> >>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >>>> of the TCE table (iommu_table::it_userspace) is not allocated till
> >>>> the very first mapping happens and we cannot call vmalloc in real mode.
> >>>>
> >>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >>>> and associates a physical IOMMU table with the SPAPR TCE table (which
> >>>> is a guest view of the hardware IOMMU table). The iommu_table object
> >>>> is referenced so we do not have to retrieve in real mode when hypercall
> >>>> happens.
> >>>>
> >>>> This does not implement the UNSET counterpart as there is no use for it -
> >>>> once the acceleration is enabled, the existing userspace won't
> >>>> disable it unless a VFIO container is detroyed so this adds necessary
> >>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>>>
> >>>> This uses the kvm->lock mutex to protect against a race between
> >>>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> >>>> release() callback.
> >>>>
> >>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>> space.
> >>>>
> >>>> This finally makes use of vfio_external_user_iommu_id() which was
> >>>> introduced quite some time ago and was considered for removal.
> >>>>
> >>>> Tests show that this patch increases transmission speed from 220MB/s
> >>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
> >>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>>>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
> >>>>  include/uapi/linux/kvm.h                   |   8 +
> >>>>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
> >>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>>>  virt/kvm/vfio.c                            | 108 +++++++++++
> >>>>  8 files changed, 630 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> index ef51740c67ca..ddb5a6512ab3 100644
> >>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> @@ -16,7 +16,24 @@ Groups:
> >>>>  
> >>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >>>> +	allocated by sPAPR KVM.
> >>>> +	kvm_device_attr.addr points to a struct:
> >>>>  
> >>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>> -for the VFIO group.
> >>>> +	struct kvm_vfio_spapr_tce {
> >>>> +		__u32	argsz;
> >>>> +		__s32	groupfd;
> >>>> +		__s32	tablefd;
> >>>> +		__u8	pad[4];
> >>>> +	};
> >>>> +
> >>>> +	where
> >>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>> +	@groupfd is a file descriptor for a VFIO group;
> >>>> +	@tablefd is a file descriptor for a TCE table allocated via
> >>>> +		KVM_CREATE_SPAPR_TCE.
> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >>>> index 28350a294b1e..94774503c70d 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_host.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_host.h
> >>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>>>  	atomic_t refcnt;
> >>>>  };
> >>>>  
> >>>> +struct kvmppc_spapr_tce_iommu_table {
> >>>> +	struct rcu_head rcu;
> >>>> +	struct list_head next;
> >>>> +	struct iommu_table *tbl;
> >>>> +	atomic_t refs;
> >>>> +};
> >>>> +
> >>>>  struct kvmppc_spapr_tce_table {
> >>>>  	struct list_head list;
> >>>>  	struct kvm *kvm;
> >>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>>>  	u32 page_shift;
> >>>>  	u64 offset;		/* in pages */
> >>>>  	u64 size;		/* window size in pages */
> >>>> +	struct list_head iommu_tables;
> >>>>  	struct page *pages[0];
> >>>>  };
> >>>>  
> >>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> index 0a21c8503974..17b947a0060d 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >>>> +				int tablefd,
> >>>> +				struct iommu_group *grp);
> >>>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >>>> +				struct iommu_group *grp);
> >>>>  
> >>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  				struct kvm_create_spapr_tce_64 *args);
> >>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>> index 810f74317987..9e4025724e28 100644
> >>>> --- a/include/uapi/linux/kvm.h
> >>>> +++ b/include/uapi/linux/kvm.h
> >>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>>>  
> >>>>  enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_MAX,
> >>>>  };
> >>>>  
> >>>> +struct kvm_vfio_spapr_tce {
> >>>> +	__u32	argsz;
> >>>> +	__s32	groupfd;
> >>>> +	__s32	tablefd;
> >>>> +	__u8	pad[4];
> >>>> +};    
> >>>
> >>> I assume you're implementing argsz and padding for future expansion,
> >>> but it doesn't really work.  Presumably argsz would be set to 16, so
> >>> the only way that the kernel will ever know something has changed would
> >>> be to make it bigger, so the padding bytes are really reserved, and
> >>> then it's not clear why we have padding at all.  If you replaced the
> >>> padding with a __u32 flags, then we could actually have some room to
> >>> architect expansion, but as it is we might as well drop both argsz and
> >>> padding.    
> >>
> >> Ah, right, sorry, did not pay attention to this bit this time. I'll replace
> >> pad with flags and move to argsz.
> >>
> >>  
> >>>     
> >>>> +
> >>>>  /*
> >>>>   * ioctls for VM fds
> >>>>   */    
> >>> ...    
> >>>> --- a/virt/kvm/vfio.c
> >>>> +++ b/virt/kvm/vfio.c
> >>>> @@ -20,6 +20,10 @@
> >>>>  #include <linux/vfio.h>
> >>>>  #include "vfio.h"
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +#include <asm/kvm_ppc.h>
> >>>> +#endif
> >>>> +
> >>>>  struct kvm_vfio_group {
> >>>>  	struct list_head node;
> >>>>  	struct vfio_group *vfio_group;
> >>>> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>  	return ret > 0;
> >>>>  }
> >>>>  
> >>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	int (*fn)(struct vfio_group *);
> >>>> +	int ret = -1;
> >>>> +
> >>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>> +	if (!fn)
> >>>> +		return ret;
> >>>> +
> >>>> +	ret = fn(vfio_group);
> >>>> +
> >>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>>  /*
> >>>>   * Groups can use the same or different IOMMU domains.  If the same then
> >>>>   * adding a new group may change the coherency of groups we've previously
> >>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>>>  	mutex_unlock(&kv->lock);
> >>>>  }
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>> +		struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	int group_id;
> >>>> +	struct iommu_group *grp;
> >>>> +
> >>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>> +	grp = iommu_group_get_by_id(group_id);
> >>>> +	if (grp) {
> >>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>> +		iommu_group_put(grp);
> >>>> +	}
> >>>> +}
> >>>> +#endif    
> >>>
> >>>
> >>> I wonder if you could use the new vfio group notifier to avoid tainting
> >>> this code with SPAPR_TCE #ifdefs.  Thanks,    
> >>
> >> I cannot see how... The new notifier sets kvm to a group, I need the
> >> opposite - attach a group to kvm and not just to KVM but to a specific KVM
> >> SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).
> >>
> >> The only way I see how I can avoid tainting this code is adding another
> >> ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
> >> that few years ago and I was told to add a KVM device or even reuse VFIO
> >> KVM device.
> >>
> >> What am I missing here?  
> > 
> > You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly
> > part of encompassing this all in #ifdefs is that we call
> > kvm_spapr_tce_{at,de}tach_iommu_group() directly. The notifier would
> > sort of make it like an arch callback, vfio core would set these
> > attributes and broadcast them to notifier callbacks, on non-spapr-tce
> > platforms nobody would be listening for those notifications.
> > Ultimately I don't know how much cleaner it makes things, but it maybe
> > avoids spapr-tce #ifdefs leaking into every layer of the stack.  Thanks,  
> 
> I am failing here.
> 
> The normal workflow:
> - create SPAPR TCE object in KVM, represents 1 LIOBN aka DMA window;
> - attach IOMMU group to SPAPR TCE object, in this step the hardware tables
> (1 or 2 iommu_table objects) are put to the SPAPR TCE's list of attached
> tables; the tables are referenced.
> 
> When reboot happens, the SPAPR TCE object is destroyed and new guest starts
> from the very beginning.
> 
> 
> The task I am solving: dereference iommu_table (hardware table) in 2 cases:
> 1) guest reboot - SPAPR TCE table is destroyed but VFIO KVM device is still
> there with all attached groups;
> 2) VFIO PCI hot unplug - SPAPR TCE table is there but groups need to be
> detached from the VFIO KVM device.
> 
> 
> Tried fixing 2) with the new callbacks:
> 
> - they do not take iommu_group, they take devices - fixed by duplicating
> the vfio_(un)register_notifier API:
>   * vfio_iommu_group_register_notifier()
>   * vfio_iommu_group_unregister_notifier()
> plus kvm wrappers with symbol_get/symbol_put in
> arch/powerpc/kvm/book3s_64_vio.c.
> 
> - the callbacks are registered per a IOMMU group, the only place in this
> code which knows about groups is SPAPR TCE driver but attach_group()
> callback is called when IOMMU driver is not yet set ->
> vfio_register_notifier() fails. KVM does not know about groups until
> KVM_DEV_VFIO_GROUP_ADD/KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE so I register
> callback in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE;
> 
> - the callback does not pass vfio_group pointer, only kvm; so my notifier
> needs to be wrapped into a struct with a group pointer, ok, done;
> 
> - the notifier needs to be unregistered and the wrapper struct from the
> previous step needs to be freed. No nice mechanisms for that - I cannot
> unregister a notifier from a notifier itself. I fixed it by calling
> rcu_sched() from the notifier when KVM==NULL and RCU-scheduled callback
> calls vfio_iommu_group_unregister_notifier().
> 
> Looks quite ugly already but ok.
> 
> 
> Now I am trying to solve 1). I can dereference iommu_table objects but
> registered notifiers remain in memory and they are not freed so each guest
> reboot increases the list length. I do not have a way to access vfio_group
> structs from KVM, there I only have a list of iommu_table structs, each of
> which has a list of iommu_group structs (this is done this way to make real
> mode handlers possible) but there is no way to get to vfio_group struct
> from iommu_group.
> 
> I can duplicate group list once again, this time is will be vfio_group list
> attached to SPAPR TCE object but this all seems to be way to much, does not
> it?...

Ok, thanks for trying.  It does seem like it gets pretty complicated,
too bad.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
@ 2016-12-19 17:27             ` Alex Williamson
  0 siblings, 0 replies; 47+ messages in thread
From: Alex Williamson @ 2016-12-19 17:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, linuxppc-dev, kvm, kvm-ppc, David Gibson

On Wed, 14 Dec 2016 14:53:13 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 10/12/16 02:35, Alex Williamson wrote:
> > On Fri, 9 Dec 2016 18:53:43 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 09/12/16 04:55, Alex Williamson wrote:  
> >>> On Thu,  8 Dec 2016 19:19:56 +1100
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>     
> >>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>>> without passing them to user space which saves time on switching
> >>>> to user space and back.
> >>>>
> >>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >>>> KVM tries to handle a TCE request in the real mode, if failed
> >>>> it passes the request to the virtual mode to complete the operation.
> >>>> If it a virtual mode handler fails, the request is passed to
> >>>> the user space; this is not expected to happen though.
> >>>>
> >>>> To avoid dealing with page use counters (which is tricky in real mode),
> >>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >>>> to pre-register the userspace memory. The very first TCE request will
> >>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >>>> of the TCE table (iommu_table::it_userspace) is not allocated till
> >>>> the very first mapping happens and we cannot call vmalloc in real mode.
> >>>>
> >>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >>>> and associates a physical IOMMU table with the SPAPR TCE table (which
> >>>> is a guest view of the hardware IOMMU table). The iommu_table object
> >>>> is referenced so we do not have to retrieve in real mode when hypercall
> >>>> happens.
> >>>>
> >>>> This does not implement the UNSET counterpart as there is no use for it -
> >>>> once the acceleration is enabled, the existing userspace won't
> >>>> disable it unless a VFIO container is detroyed so this adds necessary
> >>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>>>
> >>>> This uses the kvm->lock mutex to protect against a race between
> >>>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> >>>> release() callback.
> >>>>
> >>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>> space.
> >>>>
> >>>> This finally makes use of vfio_external_user_iommu_id() which was
> >>>> introduced quite some time ago and was considered for removal.
> >>>>
> >>>> Tests show that this patch increases transmission speed from 220MB/s
> >>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
> >>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>>>  arch/powerpc/include/asm/kvm_ppc.h         |   5 +
> >>>>  include/uapi/linux/kvm.h                   |   8 +
> >>>>  arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
> >>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>>>  virt/kvm/vfio.c                            | 108 +++++++++++
> >>>>  8 files changed, 630 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> index ef51740c67ca..ddb5a6512ab3 100644
> >>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>> @@ -16,7 +16,24 @@ Groups:
> >>>>  
> >>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>> +	for the VFIO group.
> >>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >>>> +	allocated by sPAPR KVM.
> >>>> +	kvm_device_attr.addr points to a struct:
> >>>>  
> >>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>> -for the VFIO group.
> >>>> +	struct kvm_vfio_spapr_tce {
> >>>> +		__u32	argsz;
> >>>> +		__s32	groupfd;
> >>>> +		__s32	tablefd;
> >>>> +		__u8	pad[4];
> >>>> +	};
> >>>> +
> >>>> +	where
> >>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>> +	@groupfd is a file descriptor for a VFIO group;
> >>>> +	@tablefd is a file descriptor for a TCE table allocated via
> >>>> +		KVM_CREATE_SPAPR_TCE.
> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >>>> index 28350a294b1e..94774503c70d 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_host.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_host.h
> >>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>>>  	atomic_t refcnt;
> >>>>  };
> >>>>  
> >>>> +struct kvmppc_spapr_tce_iommu_table {
> >>>> +	struct rcu_head rcu;
> >>>> +	struct list_head next;
> >>>> +	struct iommu_table *tbl;
> >>>> +	atomic_t refs;
> >>>> +};
> >>>> +
> >>>>  struct kvmppc_spapr_tce_table {
> >>>>  	struct list_head list;
> >>>>  	struct kvm *kvm;
> >>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>>>  	u32 page_shift;
> >>>>  	u64 offset;		/* in pages */
> >>>>  	u64 size;		/* window size in pages */
> >>>> +	struct list_head iommu_tables;
> >>>>  	struct page *pages[0];
> >>>>  };
> >>>>  
> >>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> index 0a21c8503974..17b947a0060d 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >>>> +				int tablefd,
> >>>> +				struct iommu_group *grp);
> >>>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >>>> +				struct iommu_group *grp);
> >>>>  
> >>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>>>  				struct kvm_create_spapr_tce_64 *args);
> >>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>> index 810f74317987..9e4025724e28 100644
> >>>> --- a/include/uapi/linux/kvm.h
> >>>> +++ b/include/uapi/linux/kvm.h
> >>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr {
> >>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>>>  
> >>>>  enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_MAX,
> >>>>  };
> >>>>  
> >>>> +struct kvm_vfio_spapr_tce {
> >>>> +	__u32	argsz;
> >>>> +	__s32	groupfd;
> >>>> +	__s32	tablefd;
> >>>> +	__u8	pad[4];
> >>>> +};    
> >>>
> >>> I assume you're implementing argsz and padding for future expansion,
> >>> but it doesn't really work.  Presumably argsz would be set to 16, so
> >>> the only way that the kernel will ever know something has changed would
> >>> be to make it bigger, so the padding bytes are really reserved, and
> >>> then it's not clear why we have padding at all.  If you replaced the
> >>> padding with a __u32 flags, then we could actually have some room to
> >>> architect expansion, but as it is we might as well drop both argsz and
> >>> padding.    
> >>
> >> Ah, right, sorry, did not pay attention to this bit this time. I'll replace
> >> pad with flags and move to argsz.
> >>
> >>  
> >>>     
> >>>> +
> >>>>  /*
> >>>>   * ioctls for VM fds
> >>>>   */    
> >>> ...    
> >>>> --- a/virt/kvm/vfio.c
> >>>> +++ b/virt/kvm/vfio.c
> >>>> @@ -20,6 +20,10 @@
> >>>>  #include <linux/vfio.h>
> >>>>  #include "vfio.h"
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +#include <asm/kvm_ppc.h>
> >>>> +#endif
> >>>> +
> >>>>  struct kvm_vfio_group {
> >>>>  	struct list_head node;
> >>>>  	struct vfio_group *vfio_group;
> >>>> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>  	return ret > 0;
> >>>>  }
> >>>>  
> >>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	int (*fn)(struct vfio_group *);
> >>>> +	int ret = -1;
> >>>> +
> >>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>> +	if (!fn)
> >>>> +		return ret;
> >>>> +
> >>>> +	ret = fn(vfio_group);
> >>>> +
> >>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>>  /*
> >>>>   * Groups can use the same or different IOMMU domains.  If the same then
> >>>>   * adding a new group may change the coherency of groups we've previously
> >>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>>>  	mutex_unlock(&kv->lock);
> >>>>  }
> >>>>  
> >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>> +		struct vfio_group *vfio_group)
> >>>> +{
> >>>> +	int group_id;
> >>>> +	struct iommu_group *grp;
> >>>> +
> >>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>> +	grp = iommu_group_get_by_id(group_id);
> >>>> +	if (grp) {
> >>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>> +		iommu_group_put(grp);
> >>>> +	}
> >>>> +}
> >>>> +#endif    
> >>>
> >>>
> >>> I wonder if you could use the new vfio group notifier to avoid tainting
> >>> this code with SPAPR_TCE #ifdefs.  Thanks,    
> >>
> >> I cannot see how... The new notifier sets kvm to a group, I need the
> >> opposite - attach a group to kvm and not just to KVM but to a specific KVM
> >> SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).
> >>
> >> The only way I see how I can avoid tainting this code is adding another
> >> ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
> >> that few years ago and I was told to add a KVM device or even reuse VFIO
> >> KVM device.
> >>
> >> What am I missing here?  
> > 
> > You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly
> > part of encompassing this all in #ifdefs is that we call
> > kvm_spapr_tce_{at,de}tach_iommu_group() directly. The notifier would
> > sort of make it like an arch callback, vfio core would set these
> > attributes and broadcast them to notifier callbacks, on non-spapr-tce
> > platforms nobody would be listening for those notifications.
> > Ultimately I don't know how much cleaner it makes things, but it maybe
> > avoids spapr-tce #ifdefs leaking into every layer of the stack.  Thanks,  
> 
> I am failing here.
> 
> The normal workflow:
> - create SPAPR TCE object in KVM, represents 1 LIOBN aka DMA window;
> - attach IOMMU group to SPAPR TCE object, in this step the hardware tables
> (1 or 2 iommu_table objects) are put to the SPAPR TCE's list of attached
> tables; the tables are referenced.
> 
> When reboot happens, the SPAPR TCE object is destroyed and new guest starts
> from the very beginning.
> 
> 
> The task I am solving: dereference iommu_table (hardware table) in 2 cases:
> 1) guest reboot - SPAPR TCE table is destroyed but VFIO KVM device is still
> there with all attached groups;
> 2) VFIO PCI hot unplug - SPAPR TCE table is there but groups need to be
> detached from the VFIO KVM device.
> 
> 
> Tried fixing 2) with the new callbacks:
> 
> - they do not take iommu_group, they take devices - fixed by duplicating
> the vfio_(un)register_notifier API:
>   * vfio_iommu_group_register_notifier()
>   * vfio_iommu_group_unregister_notifier()
> plus kvm wrappers with symbol_get/symbol_put in
> arch/powerpc/kvm/book3s_64_vio.c.
> 
> - the callbacks are registered per a IOMMU group, the only place in this
> code which knows about groups is SPAPR TCE driver but attach_group()
> callback is called when IOMMU driver is not yet set ->
> vfio_register_notifier() fails. KVM does not know about groups until
> KVM_DEV_VFIO_GROUP_ADD/KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE so I register
> callback in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE;
> 
> - the callback does not pass vfio_group pointer, only kvm; so my notifier
> needs to be wrapped into a struct with a group pointer, ok, done;
> 
> - the notifier needs to be unregistered and the wrapper struct from the
> previous step needs to be freed. No nice mechanisms for that - I cannot
> unregister a notifier from a notifier itself. I fixed it by calling
> rcu_sched() from the notifier when KVM=NULL and RCU-scheduled callback
> calls vfio_iommu_group_unregister_notifier().
> 
> Looks quite ugly already but ok.
> 
> 
> Now I am trying to solve 1). I can dereference iommu_table objects but
> registered notifiers remain in memory and they are not freed so each guest
> reboot increases the list length. I do not have a way to access vfio_group
> structs from KVM, there I only have a list of iommu_table structs, each of
> which has a list of iommu_group structs (this is done this way to make real
> mode handlers possible) but there is no way to get to vfio_group struct
> from iommu_group.
> 
> I can duplicate group list once again, this time is will be vfio_group list
> attached to SPAPR TCE object but this all seems to be way to much, does not
> it?...

Ok, thanks for trying.  It does seem like it gets pretty complicated,
too bad.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-12-19 17:27 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-08  8:19 [PATCH kernel 0/9] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2016-12-08  8:19 ` Alexey Kardashevskiy
2016-12-08  8:19 ` [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-12  4:08   ` David Gibson
2016-12-12  4:08     ` David Gibson
2016-12-08  8:19 ` [PATCH kernel 2/9] powerpc/iommu: Cleanup iommu_table disposal Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-12  4:15   ` David Gibson
2016-12-12  4:15     ` David Gibson
2016-12-08  8:19 ` [PATCH kernel 3/9] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-12  4:18   ` David Gibson
2016-12-12  4:18     ` David Gibson
2016-12-08  8:19 ` [PATCH kernel 4/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-16  0:40   ` David Gibson
2016-12-16  0:40     ` David Gibson
2016-12-08  8:19 ` [PATCH kernel 5/9] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-16  0:57   ` David Gibson
2016-12-16  0:57     ` David Gibson
2016-12-08  8:19 ` [PATCH kernel 6/9] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-16  1:06   ` David Gibson
2016-12-16  1:06     ` David Gibson
2016-12-08  8:19 ` [PATCH kernel 7/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-16  1:11   ` David Gibson
2016-12-16  1:11     ` David Gibson
2016-12-08  8:19 ` [PATCH kernel 8/9] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-16  1:32   ` David Gibson
2016-12-16  1:32     ` David Gibson
2016-12-08  8:19 ` [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
2016-12-08  8:19   ` Alexey Kardashevskiy
2016-12-08 17:55   ` Alex Williamson
2016-12-08 17:55     ` Alex Williamson
2016-12-09  7:53     ` Alexey Kardashevskiy
2016-12-09  7:53       ` Alexey Kardashevskiy
2016-12-09 15:35       ` Alex Williamson
2016-12-09 15:35         ` Alex Williamson
2016-12-14  3:53         ` Alexey Kardashevskiy
2016-12-14  3:53           ` Alexey Kardashevskiy
2016-12-19 17:27           ` Alex Williamson
2016-12-19 17:27             ` Alex Williamson
2016-12-19 17:27             ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.