All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-03-10  3:53 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on Linus'es tree sha1 c1aa905a304e.

Please comment. Thanks.

Changes:
v8:
* kept fixing oddities with error handling in 10/10

v7:
* added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c

v6:
* reworked the last patch in terms of error handling and parameters checking

v5:
* replaced "KVM: PPC: Separate TCE validation from update" with
"KVM: PPC: iommu: Unify TCE checking"
* changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
* reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
* more details in individual commit logs

v4:
* addressed comments from v3
* updated subject lines with correct component names
* regrouped the patchset in order:
	- powerpc fixes;
	- vfio_spapr_tce driver fixes;
	- KVM/PPC fixes;
	- KVM+PPC+VFIO;
* everything except last 2 patches have "Reviewed-By: David"

v3:
* there was no full repost, only last patch was posted

v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.

Alexey Kardashevskiy (10):
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Use preregistered memory API to access TCE list
  KVM: PPC: iommu: Unify TCE checking
  KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/iommu.h           |  32 ++-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  86 +++++---
 arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            |  60 ++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 18 files changed, 855 insertions(+), 107 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-03-10  3:53 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on Linus'es tree sha1 c1aa905a304e.

Please comment. Thanks.

Changes:
v8:
* kept fixing oddities with error handling in 10/10

v7:
* added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c

v6:
* reworked the last patch in terms of error handling and parameters checking

v5:
* replaced "KVM: PPC: Separate TCE validation from update" with
"KVM: PPC: iommu: Unify TCE checking"
* changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
* reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
* more details in individual commit logs

v4:
* addressed comments from v3
* updated subject lines with correct component names
* regrouped the patchset in order:
	- powerpc fixes;
	- vfio_spapr_tce driver fixes;
	- KVM/PPC fixes;
	- KVM+PPC+VFIO;
* everything except last 2 patches have "Reviewed-By: David"

v3:
* there was no full repost, only last patch was posted

v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.

Alexey Kardashevskiy (10):
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Use preregistered memory API to access TCE list
  KVM: PPC: iommu: Unify TCE checking
  KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/iommu.h           |  32 ++-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  86 +++++---
 arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            |  60 ++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 18 files changed, 855 insertions(+), 107 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 497130c5c742..fc67bd766eaf 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *pa;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	pa = (void *) vmalloc_to_phys(va);
+	if (!pa)
+		return -EFAULT;
+
+	*hpa = *pa | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 497130c5c742..fc67bd766eaf 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *pa;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	pa = (void *) vmalloc_to_phys(va);
+	if (!pa)
+		return -EFAULT;
+
+	*hpa = *pa | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..4554699aec02 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..9bace5df05d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1004,6 +1004,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+			(*direction == DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ec58b7f6b6cf..69c40b43daa3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1861,6 +1861,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1875,6 +1886,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1949,7 +1961,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -2005,6 +2017,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -2025,6 +2048,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..4554699aec02 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..9bace5df05d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1004,6 +1004,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
+			(*direction = DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ec58b7f6b6cf..69c40b43daa3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1861,6 +1861,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1875,6 +1886,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1949,7 +1961,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -2005,6 +2017,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -2025,6 +2048,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  2017-03-10  3:53 ` Alexey Kardashevskiy
  (?)
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Alex Williamson,
	Paul Mackerras, David Gibson

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v5:
* moved "tbl->it_ops = &pnv_ioda2_iommu_ops" earlier and updated
the commit log
---
 arch/powerpc/kernel/iommu.c               |  4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++------
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9bace5df05d5..bc142d87130f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 69c40b43daa3..7916d0cb05fe 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 	if (!tbl)
 		return -ENOMEM;
 
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
+
 	ret = pnv_pci_ioda2_table_alloc_pages(nid,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
@@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 		return ret;
 	}
 
-	tbl->it_ops = &pnv_ioda2_iommu_ops;
-
 	*ptbl = tbl;
 
 	return 0;
@@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
 	if (pe->pbus)
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index cf3de91fbfe7..fbec7348a7e5 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v5:
* moved "tbl->it_ops = &pnv_ioda2_iommu_ops" earlier and updated
the commit log
---
 arch/powerpc/kernel/iommu.c               |  4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++------
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9bace5df05d5..bc142d87130f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 69c40b43daa3..7916d0cb05fe 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 	if (!tbl)
 		return -ENOMEM;
 
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
+
 	ret = pnv_pci_ioda2_table_alloc_pages(nid,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
@@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 		return ret;
 	}
 
-	tbl->it_ops = &pnv_ioda2_iommu_ops;
-
 	*ptbl = tbl;
 
 	return 0;
@@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
 	if (pe->pbus)
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index cf3de91fbfe7..fbec7348a7e5 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Alex Williamson,
	Paul Mackerras, David Gibson

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v5:
* moved "tbl->it_ops = &pnv_ioda2_iommu_ops" earlier and updated
the commit log
---
 arch/powerpc/kernel/iommu.c               |  4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++------
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9bace5df05d5..bc142d87130f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 69c40b43daa3..7916d0cb05fe 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 	if (!tbl)
 		return -ENOMEM;
 
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
+
 	ret = pnv_pci_ioda2_table_alloc_pages(nid,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
@@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 		return ret;
 	}
 
-	tbl->it_ops = &pnv_ioda2_iommu_ops;
-
 	*ptbl = tbl;
 
 	return 0;
@@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
 	if (pe->pbus)
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index cf3de91fbfe7..fbec7348a7e5 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 arch/powerpc/platforms/pseries/vio.c      |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4554699aec02..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -119,6 +119,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index bc142d87130f..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7916d0cb05fe..ec3e565de511 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1425,7 +1425,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2226,7 +2226,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2322,7 +2322,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2366,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2454,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
 	if (pe->pbus)
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3427,7 +3427,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3454,7 +3454,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a43f22dc069e..9b2bdcad51ba 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 0a733ddae926..a713e20311b8 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 720493932486..744d639da92c 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index fbec7348a7e5..4f6ca9d80ead 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 arch/powerpc/platforms/pseries/vio.c      |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4554699aec02..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -119,6 +119,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index bc142d87130f..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7916d0cb05fe..ec3e565de511 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1425,7 +1425,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2226,7 +2226,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2322,7 +2322,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2366,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2454,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
 	if (pe->pbus)
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3427,7 +3427,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3454,7 +3454,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a43f22dc069e..9b2bdcad51ba 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 0a733ddae926..a713e20311b8 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 720493932486..744d639da92c 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index fbec7348a7e5..4f6ca9d80ead 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f51d5082a377..f5a52ffb6b58 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -883,6 +883,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_MMU_RADIX 134
 #define KVM_CAP_PPC_MMU_HASH_V3 135
 #define KVM_CAP_IMMEDIATE_EXIT 136
+#define KVM_CAP_SPAPR_TCE_VFIO 137
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f51d5082a377..f5a52ffb6b58 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -883,6 +883,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_MMU_RADIX 134
 #define KVM_CAP_PPC_MMU_HASH_V3 135
 #define KVM_CAP_IMMEDIATE_EXIT 136
+#define KVM_CAP_SPAPR_TCE_VFIO 137
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table()
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index dd11c4c8c56a..eba8988d8443 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -168,7 +168,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 3e26cd4979f9..e96a4590464c 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -214,12 +214,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -247,7 +248,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -301,7 +302,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index e4c4ea973e57..918af76ab2b6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -48,10 +48,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -182,12 +181,13 @@ EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -240,7 +240,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -301,7 +301,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -322,12 +322,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table()
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index dd11c4c8c56a..eba8988d8443 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -168,7 +168,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 3e26cd4979f9..e96a4590464c 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -214,12 +214,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -247,7 +248,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -301,7 +302,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index e4c4ea973e57..918af76ab2b6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -48,10 +48,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -182,12 +181,13 @@ EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -240,7 +240,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -301,7 +301,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -322,12 +322,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 08/10] KVM: PPC: Use preregistered memory API to access TCE list
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v4:
* removed oneline inlines
* now falls back to locking rmap if TCE list is not in preregistered memory

v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 918af76ab2b6..0f145fc7a3a5 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	long i, ret = H_SUCCESS;
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
+	bool prereg = false;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (mm_iommu_preregistered(vcpu->kvm->mm)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
+		if (mem)
+			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
+	}
+
+	if (!prereg) {
+		/*
+		 * This is usually a case of a guest with emulated devices only
+		 * when TCE list is not in preregistered memory.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -289,7 +314,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 08/10] KVM: PPC: Use preregistered memory API to access TCE list
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v4:
* removed oneline inlines
* now falls back to locking rmap if TCE list is not in preregistered memory

v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 918af76ab2b6..0f145fc7a3a5 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	long i, ret = H_SUCCESS;
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
+	bool prereg = false;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (mm_iommu_preregistered(vcpu->kvm->mm)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
+		if (mem)
+			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) = 0;
+	}
+
+	if (!prereg) {
+		/*
+		 * This is usually a case of a guest with emulated devices only
+		 * when TCE list is not in preregistered memory.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -289,7 +314,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 09/10] KVM: PPC: iommu: Unify TCE checking
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This reworks helpers for checking TCE update parameters in way they
can be used in KVM.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v6:
* s/tce/gpa/ as TCE without permission bits is a GPA and this is what is
passed everywhere
---
 arch/powerpc/include/asm/iommu.h    | 20 +++++++++++++++-----
 arch/powerpc/include/asm/kvm_ppc.h  |  6 ++++--
 arch/powerpc/kernel/iommu.c         | 37 +++++++++++++------------------------
 arch/powerpc/kvm/book3s_64_vio_hv.c | 31 +++++++------------------------
 4 files changed, 39 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 82e77ebf85f4..1e6b03339a68 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -296,11 +296,21 @@ static inline void iommu_restore(void)
 #endif
 
 /* The API to support IOMMU operations for VFIO */
-extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
-		unsigned long ioba, unsigned long tce_value,
-		unsigned long npages);
-extern int iommu_tce_put_param_check(struct iommu_table *tbl,
-		unsigned long ioba, unsigned long tce);
+extern int iommu_tce_check_ioba(unsigned long page_shift,
+		unsigned long offset, unsigned long size,
+		unsigned long ioba, unsigned long npages);
+extern int iommu_tce_check_gpa(unsigned long page_shift,
+		unsigned long gpa);
+
+#define iommu_tce_clear_param_check(tbl, ioba, tce_value, npages) \
+		(iommu_tce_check_ioba((tbl)->it_page_shift,       \
+				(tbl)->it_offset, (tbl)->it_size, \
+				(ioba), (npages)) || (tce_value))
+#define iommu_tce_put_param_check(tbl, ioba, gpa)                 \
+		(iommu_tce_check_ioba((tbl)->it_page_shift,       \
+				(tbl)->it_offset, (tbl)->it_size, \
+				(ioba), 1) ||                     \
+		iommu_tce_check_gpa((tbl)->it_page_shift, (gpa)))
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index eba8988d8443..72c2a155641f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -169,8 +169,10 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
 		struct kvm *kvm, unsigned long liobn);
-extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
-		unsigned long ioba, unsigned long npages);
+#define kvmppc_ioba_validate(stt, ioba, npages)                         \
+		(iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \
+				(stt)->size, (ioba), (npages)) ?        \
+				H_PARAMETER : H_SUCCESS)
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
 		unsigned long tce);
 extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d02b8d22fb50..4269f9f1623b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -960,47 +960,36 @@ void iommu_flush_tce(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_flush_tce);
 
-int iommu_tce_clear_param_check(struct iommu_table *tbl,
-		unsigned long ioba, unsigned long tce_value,
-		unsigned long npages)
+int iommu_tce_check_ioba(unsigned long page_shift,
+		unsigned long offset, unsigned long size,
+		unsigned long ioba, unsigned long npages)
 {
-	/* tbl->it_ops->clear() does not support any value but 0 */
-	if (tce_value)
-		return -EINVAL;
+	unsigned long mask = (1UL << page_shift) - 1;
 
-	if (ioba & ~IOMMU_PAGE_MASK(tbl))
+	if (ioba & mask)
 		return -EINVAL;
 
-	ioba >>= tbl->it_page_shift;
-	if (ioba < tbl->it_offset)
+	ioba >>= page_shift;
+	if (ioba < offset)
 		return -EINVAL;
 
-	if ((ioba + npages) > (tbl->it_offset + tbl->it_size))
+	if ((ioba + 1) > (offset + size))
 		return -EINVAL;
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
+EXPORT_SYMBOL_GPL(iommu_tce_check_ioba);
 
-int iommu_tce_put_param_check(struct iommu_table *tbl,
-		unsigned long ioba, unsigned long tce)
+int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
 {
-	if (tce & ~IOMMU_PAGE_MASK(tbl))
-		return -EINVAL;
-
-	if (ioba & ~IOMMU_PAGE_MASK(tbl))
-		return -EINVAL;
-
-	ioba >>= tbl->it_page_shift;
-	if (ioba < tbl->it_offset)
-		return -EINVAL;
+	unsigned long mask = (1UL << page_shift) - 1;
 
-	if ((ioba + 1) > (tbl->it_offset + tbl->it_size))
+	if (gpa & mask)
 		return -EINVAL;
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
+EXPORT_SYMBOL_GPL(iommu_tce_check_gpa);
 
 long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 0f145fc7a3a5..440d3ab5dc32 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -62,27 +62,6 @@ struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvmppc_find_table);
 
 /*
- * Validates IO address.
- *
- * WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
- */
-long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
-		unsigned long ioba, unsigned long npages)
-{
-	unsigned long mask = (1ULL << stt->page_shift) - 1;
-	unsigned long idx = ioba >> stt->page_shift;
-
-	if ((ioba & mask) || (idx < stt->offset) ||
-			(idx - stt->offset + npages > stt->size) ||
-			(idx + npages < idx))
-		return H_PARAMETER;
-
-	return H_SUCCESS;
-}
-EXPORT_SYMBOL_GPL(kvmppc_ioba_validate);
-
-/*
  * Validates TCE address.
  * At the moment flags and page mask are validated.
  * As the host kernel does not access those addresses (just puts them
@@ -95,10 +74,14 @@ EXPORT_SYMBOL_GPL(kvmppc_ioba_validate);
  */
 long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *stt, unsigned long tce)
 {
-	unsigned long page_mask = ~((1ULL << stt->page_shift) - 1);
-	unsigned long mask = ~(page_mask | TCE_PCI_WRITE | TCE_PCI_READ);
+	unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	enum dma_data_direction dir = iommu_tce_direction(tce);
 
-	if (tce & mask)
+	/* Allow userspace to poison TCE table */
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	if (iommu_tce_check_gpa(stt->page_shift, gpa))
 		return H_PARAMETER;
 
 	return H_SUCCESS;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 09/10] KVM: PPC: iommu: Unify TCE checking
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This reworks helpers for checking TCE update parameters in way they
can be used in KVM.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v6:
* s/tce/gpa/ as TCE without permission bits is a GPA and this is what is
passed everywhere
---
 arch/powerpc/include/asm/iommu.h    | 20 +++++++++++++++-----
 arch/powerpc/include/asm/kvm_ppc.h  |  6 ++++--
 arch/powerpc/kernel/iommu.c         | 37 +++++++++++++------------------------
 arch/powerpc/kvm/book3s_64_vio_hv.c | 31 +++++++------------------------
 4 files changed, 39 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 82e77ebf85f4..1e6b03339a68 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -296,11 +296,21 @@ static inline void iommu_restore(void)
 #endif
 
 /* The API to support IOMMU operations for VFIO */
-extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
-		unsigned long ioba, unsigned long tce_value,
-		unsigned long npages);
-extern int iommu_tce_put_param_check(struct iommu_table *tbl,
-		unsigned long ioba, unsigned long tce);
+extern int iommu_tce_check_ioba(unsigned long page_shift,
+		unsigned long offset, unsigned long size,
+		unsigned long ioba, unsigned long npages);
+extern int iommu_tce_check_gpa(unsigned long page_shift,
+		unsigned long gpa);
+
+#define iommu_tce_clear_param_check(tbl, ioba, tce_value, npages) \
+		(iommu_tce_check_ioba((tbl)->it_page_shift,       \
+				(tbl)->it_offset, (tbl)->it_size, \
+				(ioba), (npages)) || (tce_value))
+#define iommu_tce_put_param_check(tbl, ioba, gpa)                 \
+		(iommu_tce_check_ioba((tbl)->it_page_shift,       \
+				(tbl)->it_offset, (tbl)->it_size, \
+				(ioba), 1) ||                     \
+		iommu_tce_check_gpa((tbl)->it_page_shift, (gpa)))
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index eba8988d8443..72c2a155641f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -169,8 +169,10 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
 		struct kvm *kvm, unsigned long liobn);
-extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
-		unsigned long ioba, unsigned long npages);
+#define kvmppc_ioba_validate(stt, ioba, npages)                         \
+		(iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \
+				(stt)->size, (ioba), (npages)) ?        \
+				H_PARAMETER : H_SUCCESS)
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
 		unsigned long tce);
 extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d02b8d22fb50..4269f9f1623b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -960,47 +960,36 @@ void iommu_flush_tce(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_flush_tce);
 
-int iommu_tce_clear_param_check(struct iommu_table *tbl,
-		unsigned long ioba, unsigned long tce_value,
-		unsigned long npages)
+int iommu_tce_check_ioba(unsigned long page_shift,
+		unsigned long offset, unsigned long size,
+		unsigned long ioba, unsigned long npages)
 {
-	/* tbl->it_ops->clear() does not support any value but 0 */
-	if (tce_value)
-		return -EINVAL;
+	unsigned long mask = (1UL << page_shift) - 1;
 
-	if (ioba & ~IOMMU_PAGE_MASK(tbl))
+	if (ioba & mask)
 		return -EINVAL;
 
-	ioba >>= tbl->it_page_shift;
-	if (ioba < tbl->it_offset)
+	ioba >>= page_shift;
+	if (ioba < offset)
 		return -EINVAL;
 
-	if ((ioba + npages) > (tbl->it_offset + tbl->it_size))
+	if ((ioba + 1) > (offset + size))
 		return -EINVAL;
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
+EXPORT_SYMBOL_GPL(iommu_tce_check_ioba);
 
-int iommu_tce_put_param_check(struct iommu_table *tbl,
-		unsigned long ioba, unsigned long tce)
+int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
 {
-	if (tce & ~IOMMU_PAGE_MASK(tbl))
-		return -EINVAL;
-
-	if (ioba & ~IOMMU_PAGE_MASK(tbl))
-		return -EINVAL;
-
-	ioba >>= tbl->it_page_shift;
-	if (ioba < tbl->it_offset)
-		return -EINVAL;
+	unsigned long mask = (1UL << page_shift) - 1;
 
-	if ((ioba + 1) > (tbl->it_offset + tbl->it_size))
+	if (gpa & mask)
 		return -EINVAL;
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
+EXPORT_SYMBOL_GPL(iommu_tce_check_gpa);
 
 long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 0f145fc7a3a5..440d3ab5dc32 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -62,27 +62,6 @@ struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvmppc_find_table);
 
 /*
- * Validates IO address.
- *
- * WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
- */
-long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
-		unsigned long ioba, unsigned long npages)
-{
-	unsigned long mask = (1ULL << stt->page_shift) - 1;
-	unsigned long idx = ioba >> stt->page_shift;
-
-	if ((ioba & mask) || (idx < stt->offset) ||
-			(idx - stt->offset + npages > stt->size) ||
-			(idx + npages < idx))
-		return H_PARAMETER;
-
-	return H_SUCCESS;
-}
-EXPORT_SYMBOL_GPL(kvmppc_ioba_validate);
-
-/*
  * Validates TCE address.
  * At the moment flags and page mask are validated.
  * As the host kernel does not access those addresses (just puts them
@@ -95,10 +74,14 @@ EXPORT_SYMBOL_GPL(kvmppc_ioba_validate);
  */
 long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *stt, unsigned long tce)
 {
-	unsigned long page_mask = ~((1ULL << stt->page_shift) - 1);
-	unsigned long mask = ~(page_mask | TCE_PCI_WRITE | TCE_PCI_READ);
+	unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	enum dma_data_direction dir = iommu_tce_direction(tce);
 
-	if (tce & mask)
+	/* Allow userspace to poison TCE table */
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	if (iommu_tce_check_gpa(stt->page_shift, gpa))
 		return H_PARAMETER;
 
 	return H_SUCCESS;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN; we do not remove duplicates though as
iommu_table_ops::exchange not just update a TCE entry (which is
shared among IOMMU groups) but also invalidates the TCE cache
(one per IOMMU group).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v8:
* changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
to handle them
* changed vmalloc_to_phys() callers to return H_HARDWARE
* changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
and added a comment about this in the code
* changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
and do WARN_ON
* added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
have all vmalloc_to_phys() callsites covered

v7:
* added realmode-friendly WARN_ON_ONCE_RM

v6:
* changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
* moved kvmppc_gpa_to_ua() to TCE validation

v5:
* changed error codes in multiple places
* added bunch of WARN_ON() in places which should not really happen
* adde a check that an iommu table is not attached already to LIOBN
* dropped explicit calls to iommu_tce_clear_param_check/
iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
call them anyway (since the previous patch)
* if we fail to update a hardware IOMMU table for unexpected reason,
this just clears the entry

v4:
* added note to the commit log about allowing multiple updates of
the same IOMMU table;
* instead of checking for if any memory was preregistered, this
returns H_TOO_HARD if a specific page was not;
* fixed comments from v3 about error handling in many places;
* simplified TCE handlers and merged IOMMU parts inline - for example,
there used to be kvmppc_h_put_tce_iommu(), now it is merged into
kvmppc_h_put_tce(); this allows to check IOBA boundaries against
the first attached table only (makes the code simpler);

v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---
 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  60 ++++++
 8 files changed, 623 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 7bba8f415627..857ae2c6aa39 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 72c2a155641f..66de7e73b3d3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group);
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f5a52ffb6b58..e743cb0d176e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1088,6 +1088,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1109,6 +1110,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index e96a4590464c..be18cda01e1b 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -28,6 +28,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -40,6 +44,36 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (WARN_ON(!fn))
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
+
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	iommu_table_put(stit->tbl);
+	kvm_vfio_group_put_external_user(stit->group);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_liobn_release_iommu_group(
+		struct kvmppc_spapr_tce_table *stt,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
+
+	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
+		if (group && (stit->group != group))
+			continue;
+
+		list_del_rcu(&stit->next);
+
+		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+	}
+}
+
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
+		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i, ret = 0;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(group);
+	grp = iommu_group_get_by_id(group_id);
+	if (WARN_ON(!grp))
+		return -EIO;
+
+	f = fdget(tablefd);
+	if (!f.file) {
+		ret = -EBADF;
+		goto put_exit;
+	}
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt == f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found) {
+		ret = -EINVAL;
+		goto put_exit;
+	}
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (WARN_ON(!table_group)) {
+		ret = -EFAULT;
+		goto put_exit;
+	}
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		/*
+		 * Make sure hardware table parameters are exactly the same;
+		 * this is used in the TCE handlers where boundary checks
+		 * use only the first attached table.
+		 */
+		if ((tbltmp->it_page_shift == stt->page_shift) &&
+				(tbltmp->it_offset == stt->offset) &&
+				(tbltmp->it_size == stt->size)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl) {
+		ret = -EINVAL;
+		goto put_exit;
+	}
+
+	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
+		if ((stit->tbl == tbl) && (stit->group == group)) {
+			ret = -EBUSY;
+			goto put_exit;
+		}
+	}
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+put_exit:
+	iommu_group_put(grp);
+
+	return ret;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 
 	list_del_rcu(&stt->list);
 
+	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
+{
+	unsigned long hpa = 0;
+	enum dma_data_direction dir = DMA_NONE;
+
+	iommu_tce_xchg(tbl, entry, &hpa, &dir);
+}
+
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret != H_SUCCESS)
+		iommu_tce_xchg(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long ua,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
+		return H_TOO_HARD;
+
+	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_CLOSED;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (WARN_ON_ONCE(ret)) {
+		mm_iommu_mapped_dec(mem);
+		return H_HARDWARE;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
-	long ret;
+	long ret, idx;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, ua = 0;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
+	dir = iommu_tce_direction(tce);
+	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
+			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
+		return H_PARAMETER;
+
+	entry = ioba >> stt->page_shift;
+
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		if (dir == DMA_NONE) {
+			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry);
+		} else {
+			idx = srcu_read_lock(&vcpu->kvm->srcu);
+			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
+					entry, ua, dir);
+			srcu_read_unlock(&vcpu->kvm->srcu, idx);
+		}
+
+		if (ret == H_SUCCESS)
+			continue;
+
+		if (ret == H_TOO_HARD)
+			return ret;
+
+		WARN_ON_ONCE(1);
+		kvmppc_clear_tce(stit->tbl, entry);
+	}
+
+	kvmppc_tce_put(stt, entry, tce);
 
 	return H_SUCCESS;
 }
@@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
 
+		if (kvmppc_gpa_to_ua(vcpu->kvm,
+				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
+				&ua, NULL))
+			return H_PARAMETER;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry + i, ua,
+					iommu_tce_direction(tce));
+
+			if (ret == H_SUCCESS)
+				continue;
+
+			if (ret == H_TOO_HARD)
+				goto unlock_exit;
+
+			WARN_ON_ONCE(1);
+			kvmppc_clear_tce(stit->tbl, entry);
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+		for (i = 0; i < npages; ++i) {
+			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry + i);
+
+			if (ret == H_SUCCESS)
+				continue;
+
+			if (ret == H_TOO_HARD)
+				return ret;
+
+			WARN_ON_ONCE(1);
+			kvmppc_clear_tce(stit->tbl, entry);
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 440d3ab5dc32..eda0a8f6fae8 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -40,6 +40,31 @@
 #include <asm/iommu.h>
 #include <asm/tce.h>
 
+#ifdef CONFIG_BUG
+
+#define WARN_ON_ONCE_RM(condition)	({			\
+	static bool __section(.data.unlikely) __warned;		\
+	int __ret_warn_once = !!(condition);			\
+								\
+	if (unlikely(__ret_warn_once && !__warned)) {		\
+		__warned = true;				\
+		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
+				__stringify(condition),		\
+				__func__, __LINE__);		\
+		dump_stack();					\
+	}							\
+	unlikely(__ret_warn_once);				\
+})
+
+#else
+
+#define WARN_ON_ONCE_RM(condition) ({				\
+	int __ret_warn_on = !!(condition);			\
+	unlikely(__ret_warn_on);				\
+})
+
+#endif
+
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
 /*
@@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
+{
+	unsigned long hpa = 0;
+	enum dma_data_direction dir = DMA_NONE;
+
+	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+}
+
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (WARN_ON_ONCE_RM(!pua))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		/*
+		 * real mode xchg can fail if struct page crosses
+		 * a page boundary
+		 */
+		return H_TOO_HARD;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret)
+		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long ua,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_TOO_HARD;
+
+	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (WARN_ON_ONCE_RM(!pua))
+		return H_HARDWARE;
+
+	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
+		return H_CLOSED;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		/*
+		 * real mode xchg can fail if struct page crosses
+		 * a page boundary
+		 */
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, ua = 0;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
+	dir = iommu_tce_direction(tce);
+	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
+			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
+		return H_PARAMETER;
+
+	entry = ioba >> stt->page_shift;
+
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		if (dir == DMA_NONE)
+			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry);
+		else
+			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry, ua, dir);
+
+		if (ret == H_SUCCESS)
+			continue;
+
+		if (ret == H_TOO_HARD)
+			return ret;
+
+		WARN_ON_ONCE_RM(1);
+		kvmppc_rm_clear_tce(stit->tbl, entry);
+	}
+
+	kvmppc_tce_put(stt, entry, tce);
 
 	return H_SUCCESS;
 }
@@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 	bool prereg = false;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 			return H_TOO_HARD;
 
 		rmap = (void *) vmalloc_to_phys(rmap);
+		if (WARN_ON_ONCE_RM(!rmap))
+			return H_HARDWARE;
 
 		/*
 		 * Synchronize with the MMU notifier callbacks in
@@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
 
+		ua = 0;
+		if (kvmppc_gpa_to_ua(vcpu->kvm,
+				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
+				&ua, NULL))
+			return H_PARAMETER;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry + i, ua,
+					iommu_tce_direction(tce));
+
+			if (ret == H_SUCCESS)
+				continue;
+
+			if (ret == H_TOO_HARD)
+				goto unlock_exit;
+
+			WARN_ON_ONCE_RM(1);
+			kvmppc_rm_clear_tce(stit->tbl, entry);
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+		for (i = 0; i < npages; ++i) {
+			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry + i);
+
+			if (ret == H_SUCCESS)
+				continue;
+
+			if (ret == H_TOO_HARD)
+				return ret;
+
+			WARN_ON_ONCE_RM(1);
+			kvmppc_rm_clear_tce(stit->tbl, entry);
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95c91a9de351..62bdd6c48107 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
 	case KVM_CAP_PPC_ENABLE_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..2b7dc22265fe 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		mutex_unlock(&kv->lock);
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(vfio_group, NULL);
 
 		kvm_vfio_group_put_external_user(vfio_group);
@@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group);
+
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-10  3:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-10  3:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN; we do not remove duplicates though as
iommu_table_ops::exchange not just update a TCE entry (which is
shared among IOMMU groups) but also invalidates the TCE cache
(one per IOMMU group).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v8:
* changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
to handle them
* changed vmalloc_to_phys() callers to return H_HARDWARE
* changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
and added a comment about this in the code
* changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
and do WARN_ON
* added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
have all vmalloc_to_phys() callsites covered

v7:
* added realmode-friendly WARN_ON_ONCE_RM

v6:
* changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
* moved kvmppc_gpa_to_ua() to TCE validation

v5:
* changed error codes in multiple places
* added bunch of WARN_ON() in places which should not really happen
* adde a check that an iommu table is not attached already to LIOBN
* dropped explicit calls to iommu_tce_clear_param_check/
iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
call them anyway (since the previous patch)
* if we fail to update a hardware IOMMU table for unexpected reason,
this just clears the entry

v4:
* added note to the commit log about allowing multiple updates of
the same IOMMU table;
* instead of checking for if any memory was preregistered, this
returns H_TOO_HARD if a specific page was not;
* fixed comments from v3 about error handling in many places;
* simplified TCE handlers and merged IOMMU parts inline - for example,
there used to be kvmppc_h_put_tce_iommu(), now it is merged into
kvmppc_h_put_tce(); this allows to check IOBA boundaries against
the first attached table only (makes the code simpler);

v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---
 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  60 ++++++
 8 files changed, 623 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 7bba8f415627..857ae2c6aa39 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 72c2a155641f..66de7e73b3d3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group);
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f5a52ffb6b58..e743cb0d176e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1088,6 +1088,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1109,6 +1110,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index e96a4590464c..be18cda01e1b 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -28,6 +28,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -40,6 +44,36 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (WARN_ON(!fn))
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
+
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	iommu_table_put(stit->tbl);
+	kvm_vfio_group_put_external_user(stit->group);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_liobn_release_iommu_group(
+		struct kvmppc_spapr_tce_table *stt,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
+
+	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
+		if (group && (stit->group != group))
+			continue;
+
+		list_del_rcu(&stit->next);
+
+		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+	}
+}
+
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
+		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i, ret = 0;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(group);
+	grp = iommu_group_get_by_id(group_id);
+	if (WARN_ON(!grp))
+		return -EIO;
+
+	f = fdget(tablefd);
+	if (!f.file) {
+		ret = -EBADF;
+		goto put_exit;
+	}
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt = f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found) {
+		ret = -EINVAL;
+		goto put_exit;
+	}
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (WARN_ON(!table_group)) {
+		ret = -EFAULT;
+		goto put_exit;
+	}
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		/*
+		 * Make sure hardware table parameters are exactly the same;
+		 * this is used in the TCE handlers where boundary checks
+		 * use only the first attached table.
+		 */
+		if ((tbltmp->it_page_shift = stt->page_shift) &&
+				(tbltmp->it_offset = stt->offset) &&
+				(tbltmp->it_size = stt->size)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl) {
+		ret = -EINVAL;
+		goto put_exit;
+	}
+
+	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
+		if ((stit->tbl = tbl) && (stit->group = group)) {
+			ret = -EBUSY;
+			goto put_exit;
+		}
+	}
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+put_exit:
+	iommu_group_put(grp);
+
+	return ret;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 
 	list_del_rcu(&stt->list);
 
+	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
+{
+	unsigned long hpa = 0;
+	enum dma_data_direction dir = DMA_NONE;
+
+	iommu_tce_xchg(tbl, entry, &hpa, &dir);
+}
+
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret != H_SUCCESS)
+		iommu_tce_xchg(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long ua,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
+		return H_TOO_HARD;
+
+	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_CLOSED;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (WARN_ON_ONCE(ret)) {
+		mm_iommu_mapped_dec(mem);
+		return H_HARDWARE;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
-	long ret;
+	long ret, idx;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, ua = 0;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
+	dir = iommu_tce_direction(tce);
+	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
+			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
+		return H_PARAMETER;
+
+	entry = ioba >> stt->page_shift;
+
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		if (dir = DMA_NONE) {
+			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry);
+		} else {
+			idx = srcu_read_lock(&vcpu->kvm->srcu);
+			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
+					entry, ua, dir);
+			srcu_read_unlock(&vcpu->kvm->srcu, idx);
+		}
+
+		if (ret = H_SUCCESS)
+			continue;
+
+		if (ret = H_TOO_HARD)
+			return ret;
+
+		WARN_ON_ONCE(1);
+		kvmppc_clear_tce(stit->tbl, entry);
+	}
+
+	kvmppc_tce_put(stt, entry, tce);
 
 	return H_SUCCESS;
 }
@@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
 
+		if (kvmppc_gpa_to_ua(vcpu->kvm,
+				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
+				&ua, NULL))
+			return H_PARAMETER;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry + i, ua,
+					iommu_tce_direction(tce));
+
+			if (ret = H_SUCCESS)
+				continue;
+
+			if (ret = H_TOO_HARD)
+				goto unlock_exit;
+
+			WARN_ON_ONCE(1);
+			kvmppc_clear_tce(stit->tbl, entry);
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+		for (i = 0; i < npages; ++i) {
+			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry + i);
+
+			if (ret = H_SUCCESS)
+				continue;
+
+			if (ret = H_TOO_HARD)
+				return ret;
+
+			WARN_ON_ONCE(1);
+			kvmppc_clear_tce(stit->tbl, entry);
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 440d3ab5dc32..eda0a8f6fae8 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -40,6 +40,31 @@
 #include <asm/iommu.h>
 #include <asm/tce.h>
 
+#ifdef CONFIG_BUG
+
+#define WARN_ON_ONCE_RM(condition)	({			\
+	static bool __section(.data.unlikely) __warned;		\
+	int __ret_warn_once = !!(condition);			\
+								\
+	if (unlikely(__ret_warn_once && !__warned)) {		\
+		__warned = true;				\
+		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
+				__stringify(condition),		\
+				__func__, __LINE__);		\
+		dump_stack();					\
+	}							\
+	unlikely(__ret_warn_once);				\
+})
+
+#else
+
+#define WARN_ON_ONCE_RM(condition) ({				\
+	int __ret_warn_on = !!(condition);			\
+	unlikely(__ret_warn_on);				\
+})
+
+#endif
+
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
 /*
@@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
+{
+	unsigned long hpa = 0;
+	enum dma_data_direction dir = DMA_NONE;
+
+	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+}
+
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (WARN_ON_ONCE_RM(!pua))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		/*
+		 * real mode xchg can fail if struct page crosses
+		 * a page boundary
+		 */
+		return H_TOO_HARD;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret)
+		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long ua,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_TOO_HARD;
+
+	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (WARN_ON_ONCE_RM(!pua))
+		return H_HARDWARE;
+
+	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
+		return H_CLOSED;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		/*
+		 * real mode xchg can fail if struct page crosses
+		 * a page boundary
+		 */
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, ua = 0;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
+	dir = iommu_tce_direction(tce);
+	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
+			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
+		return H_PARAMETER;
+
+	entry = ioba >> stt->page_shift;
+
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		if (dir = DMA_NONE)
+			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry);
+		else
+			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry, ua, dir);
+
+		if (ret = H_SUCCESS)
+			continue;
+
+		if (ret = H_TOO_HARD)
+			return ret;
+
+		WARN_ON_ONCE_RM(1);
+		kvmppc_rm_clear_tce(stit->tbl, entry);
+	}
+
+	kvmppc_tce_put(stt, entry, tce);
 
 	return H_SUCCESS;
 }
@@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 	bool prereg = false;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 			return H_TOO_HARD;
 
 		rmap = (void *) vmalloc_to_phys(rmap);
+		if (WARN_ON_ONCE_RM(!rmap))
+			return H_HARDWARE;
 
 		/*
 		 * Synchronize with the MMU notifier callbacks in
@@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
 
+		ua = 0;
+		if (kvmppc_gpa_to_ua(vcpu->kvm,
+				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
+				&ua, NULL))
+			return H_PARAMETER;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry + i, ua,
+					iommu_tce_direction(tce));
+
+			if (ret = H_SUCCESS)
+				continue;
+
+			if (ret = H_TOO_HARD)
+				goto unlock_exit;
+
+			WARN_ON_ONCE_RM(1);
+			kvmppc_rm_clear_tce(stit->tbl, entry);
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+		for (i = 0; i < npages; ++i) {
+			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry + i);
+
+			if (ret = H_SUCCESS)
+				continue;
+
+			if (ret = H_TOO_HARD)
+				return ret;
+
+			WARN_ON_ONCE_RM(1);
+			kvmppc_rm_clear_tce(stit->tbl, entry);
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95c91a9de351..62bdd6c48107 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
 	case KVM_CAP_PPC_ENABLE_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..2b7dc22265fe 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		mutex_unlock(&kv->lock);
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(vfio_group, NULL);
 
 		kvm_vfio_group_put_external_user(vfio_group);
@@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group);
+
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-10  3:53   ` Alexey Kardashevskiy
@ 2017-03-10  4:47     ` David Gibson
  -1 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-10  4:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 31802 bytes --]

On Fri, Mar 10, 2017 at 02:53:37PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 623 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 7bba8f415627..857ae2c6aa39 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 72c2a155641f..66de7e73b3d3 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f5a52ffb6b58..e743cb0d176e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1088,6 +1088,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1109,6 +1110,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index e96a4590464c..be18cda01e1b 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -28,6 +28,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -40,6 +44,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (WARN_ON(!grp))
> +		return -EIO;
> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;
> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
> +		if ((stit->tbl == tbl) && (stit->group == group)) {
> +			ret = -EBUSY;
> +			goto put_exit;
> +		}
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> +
> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long ua,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (WARN_ON_ONCE(ret)) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_HARDWARE;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, ua = 0;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	dir = iommu_tce_direction(tce);
> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> +		return H_PARAMETER;
> +
> +	entry = ioba >> stt->page_shift;
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir == DMA_NONE) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		} else {
> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +					entry, ua, dir);
> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +		}
> +
> +		if (ret == H_SUCCESS)
> +			continue;
> +
> +		if (ret == H_TOO_HARD)
> +			return ret;
> +
> +		WARN_ON_ONCE(1);
> +		kvmppc_clear_tce(stit->tbl, entry);
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long entry, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> +				&ua, NULL))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, ua,
> +					iommu_tce_direction(tce));
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				goto unlock_exit;
> +
> +			WARN_ON_ONCE(1);
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				return ret;
> +
> +			WARN_ON_ONCE(1);
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 440d3ab5dc32..eda0a8f6fae8 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -40,6 +40,31 @@
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
>  
> +#ifdef CONFIG_BUG
> +
> +#define WARN_ON_ONCE_RM(condition)	({			\
> +	static bool __section(.data.unlikely) __warned;		\
> +	int __ret_warn_once = !!(condition);			\
> +								\
> +	if (unlikely(__ret_warn_once && !__warned)) {		\
> +		__warned = true;				\
> +		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
> +				__stringify(condition),		\
> +				__func__, __LINE__);		\
> +		dump_stack();					\
> +	}							\
> +	unlikely(__ret_warn_once);				\
> +})
> +
> +#else
> +
> +#define WARN_ON_ONCE_RM(condition) ({				\
> +	int __ret_warn_on = !!(condition);			\
> +	unlikely(__ret_warn_on);				\
> +})
> +
> +#endif
> +
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>  
>  /*
> @@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON_ONCE_RM(!pua))
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		/*
> +		 * real mode xchg can fail if struct page crosses
> +		 * a page boundary
> +		 */
> +		return H_TOO_HARD;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long ua,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON_ONCE_RM(!pua))
> +		return H_HARDWARE;
> +
> +	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		/*
> +		 * real mode xchg can fail if struct page crosses
> +		 * a page boundary
> +		 */
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, ua = 0;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	dir = iommu_tce_direction(tce);
> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> +		return H_PARAMETER;
> +
> +	entry = ioba >> stt->page_shift;
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir == DMA_NONE)
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		else
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry, ua, dir);
> +
> +		if (ret == H_SUCCESS)
> +			continue;
> +
> +		if (ret == H_TOO_HARD)
> +			return ret;
> +
> +		WARN_ON_ONCE_RM(1);
> +		kvmppc_rm_clear_tce(stit->tbl, entry);
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  			return H_TOO_HARD;
>  
>  		rmap = (void *) vmalloc_to_phys(rmap);
> +		if (WARN_ON_ONCE_RM(!rmap))
> +			return H_HARDWARE;
>  
>  		/*
>  		 * Synchronize with the MMU notifier callbacks in
> @@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		ua = 0;
> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> +				&ua, NULL))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, ua,
> +					iommu_tce_direction(tce));
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				goto unlock_exit;
> +
> +			WARN_ON_ONCE_RM(1);
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				return ret;
> +
> +			WARN_ON_ONCE_RM(1);
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 95c91a9de351..62bdd6c48107 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-10  4:47     ` David Gibson
  0 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-10  4:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 31802 bytes --]

On Fri, Mar 10, 2017 at 02:53:37PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 623 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 7bba8f415627..857ae2c6aa39 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 72c2a155641f..66de7e73b3d3 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f5a52ffb6b58..e743cb0d176e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1088,6 +1088,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1109,6 +1110,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index e96a4590464c..be18cda01e1b 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -28,6 +28,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -40,6 +44,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (WARN_ON(!grp))
> +		return -EIO;
> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;
> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
> +		if ((stit->tbl == tbl) && (stit->group == group)) {
> +			ret = -EBUSY;
> +			goto put_exit;
> +		}
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> +
> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long ua,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (WARN_ON_ONCE(ret)) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_HARDWARE;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, ua = 0;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	dir = iommu_tce_direction(tce);
> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> +		return H_PARAMETER;
> +
> +	entry = ioba >> stt->page_shift;
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir == DMA_NONE) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		} else {
> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +					entry, ua, dir);
> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +		}
> +
> +		if (ret == H_SUCCESS)
> +			continue;
> +
> +		if (ret == H_TOO_HARD)
> +			return ret;
> +
> +		WARN_ON_ONCE(1);
> +		kvmppc_clear_tce(stit->tbl, entry);
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long entry, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> +				&ua, NULL))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, ua,
> +					iommu_tce_direction(tce));
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				goto unlock_exit;
> +
> +			WARN_ON_ONCE(1);
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				return ret;
> +
> +			WARN_ON_ONCE(1);
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 440d3ab5dc32..eda0a8f6fae8 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -40,6 +40,31 @@
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
>  
> +#ifdef CONFIG_BUG
> +
> +#define WARN_ON_ONCE_RM(condition)	({			\
> +	static bool __section(.data.unlikely) __warned;		\
> +	int __ret_warn_once = !!(condition);			\
> +								\
> +	if (unlikely(__ret_warn_once && !__warned)) {		\
> +		__warned = true;				\
> +		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
> +				__stringify(condition),		\
> +				__func__, __LINE__);		\
> +		dump_stack();					\
> +	}							\
> +	unlikely(__ret_warn_once);				\
> +})
> +
> +#else
> +
> +#define WARN_ON_ONCE_RM(condition) ({				\
> +	int __ret_warn_on = !!(condition);			\
> +	unlikely(__ret_warn_on);				\
> +})
> +
> +#endif
> +
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>  
>  /*
> @@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON_ONCE_RM(!pua))
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		/*
> +		 * real mode xchg can fail if struct page crosses
> +		 * a page boundary
> +		 */
> +		return H_TOO_HARD;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long ua,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON_ONCE_RM(!pua))
> +		return H_HARDWARE;
> +
> +	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		/*
> +		 * real mode xchg can fail if struct page crosses
> +		 * a page boundary
> +		 */
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, ua = 0;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	dir = iommu_tce_direction(tce);
> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> +		return H_PARAMETER;
> +
> +	entry = ioba >> stt->page_shift;
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir == DMA_NONE)
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		else
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry, ua, dir);
> +
> +		if (ret == H_SUCCESS)
> +			continue;
> +
> +		if (ret == H_TOO_HARD)
> +			return ret;
> +
> +		WARN_ON_ONCE_RM(1);
> +		kvmppc_rm_clear_tce(stit->tbl, entry);
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  			return H_TOO_HARD;
>  
>  		rmap = (void *) vmalloc_to_phys(rmap);
> +		if (WARN_ON_ONCE_RM(!rmap))
> +			return H_HARDWARE;
>  
>  		/*
>  		 * Synchronize with the MMU notifier callbacks in
> @@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		ua = 0;
> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> +				&ua, NULL))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, ua,
> +					iommu_tce_direction(tce));
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				goto unlock_exit;
> +
> +			WARN_ON_ONCE_RM(1);
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				return ret;
> +
> +			WARN_ON_ONCE_RM(1);
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 95c91a9de351..62bdd6c48107 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
  2017-03-10  3:53 ` Alexey Kardashevskiy
@ 2017-03-10  4:48   ` David Gibson
  -1 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-10  4:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 3471 bytes --]

On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:
> This is my current queue of patches to add acceleration of TCE
> updates in KVM.
> 
> This is based on Linus'es tree sha1 c1aa905a304e.

I think we're finally there - I've now sent an R-b for all patches.


> 
> Please comment. Thanks.
> 
> Changes:
> v8:
> * kept fixing oddities with error handling in 10/10
> 
> v7:
> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> 
> v6:
> * reworked the last patch in terms of error handling and parameters checking
> 
> v5:
> * replaced "KVM: PPC: Separate TCE validation from update" with
> "KVM: PPC: iommu: Unify TCE checking"
> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> * more details in individual commit logs
> 
> v4:
> * addressed comments from v3
> * updated subject lines with correct component names
> * regrouped the patchset in order:
> 	- powerpc fixes;
> 	- vfio_spapr_tce driver fixes;
> 	- KVM/PPC fixes;
> 	- KVM+PPC+VFIO;
> * everything except last 2 patches have "Reviewed-By: David"
> 
> v3:
> * there was no full repost, only last patch was posted
> 
> v2:
> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> a issue;
> * got 09/11, 10/11 to use notifiers in 11/11;
> * added rb: David to most of patches and added a comment in 05/11.
> 
> Alexey Kardashevskiy (10):
>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
>   powerpc/powernv/iommu: Add real mode version of
>     iommu_table_ops::exchange()
>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
>   KVM: PPC: Pass kvm* to kvmppc_find_table()
>   KVM: PPC: Use preregistered memory API to access TCE list
>   KVM: PPC: iommu: Unify TCE checking
>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> 
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/iommu.h           |  32 ++-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
>  arch/powerpc/include/asm/mmu_context.h     |   4 +
>  include/uapi/linux/kvm.h                   |   9 +
>  arch/powerpc/kernel/iommu.c                |  86 +++++---
>  arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
>  arch/powerpc/platforms/powernv/pci.c       |   1 +
>  arch/powerpc/platforms/pseries/iommu.c     |   3 +-
>  arch/powerpc/platforms/pseries/vio.c       |   2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
>  virt/kvm/vfio.c                            |  60 ++++++
>  arch/powerpc/kvm/Kconfig                   |   1 +
>  18 files changed, 855 insertions(+), 107 deletions(-)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-03-10  4:48   ` David Gibson
  0 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-10  4:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 3471 bytes --]

On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:
> This is my current queue of patches to add acceleration of TCE
> updates in KVM.
> 
> This is based on Linus'es tree sha1 c1aa905a304e.

I think we're finally there - I've now sent an R-b for all patches.


> 
> Please comment. Thanks.
> 
> Changes:
> v8:
> * kept fixing oddities with error handling in 10/10
> 
> v7:
> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> 
> v6:
> * reworked the last patch in terms of error handling and parameters checking
> 
> v5:
> * replaced "KVM: PPC: Separate TCE validation from update" with
> "KVM: PPC: iommu: Unify TCE checking"
> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> * more details in individual commit logs
> 
> v4:
> * addressed comments from v3
> * updated subject lines with correct component names
> * regrouped the patchset in order:
> 	- powerpc fixes;
> 	- vfio_spapr_tce driver fixes;
> 	- KVM/PPC fixes;
> 	- KVM+PPC+VFIO;
> * everything except last 2 patches have "Reviewed-By: David"
> 
> v3:
> * there was no full repost, only last patch was posted
> 
> v2:
> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> a issue;
> * got 09/11, 10/11 to use notifiers in 11/11;
> * added rb: David to most of patches and added a comment in 05/11.
> 
> Alexey Kardashevskiy (10):
>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
>   powerpc/powernv/iommu: Add real mode version of
>     iommu_table_ops::exchange()
>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
>   KVM: PPC: Pass kvm* to kvmppc_find_table()
>   KVM: PPC: Use preregistered memory API to access TCE list
>   KVM: PPC: iommu: Unify TCE checking
>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> 
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/iommu.h           |  32 ++-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
>  arch/powerpc/include/asm/mmu_context.h     |   4 +
>  include/uapi/linux/kvm.h                   |   9 +
>  arch/powerpc/kernel/iommu.c                |  86 +++++---
>  arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
>  arch/powerpc/platforms/powernv/pci.c       |   1 +
>  arch/powerpc/platforms/pseries/iommu.c     |   3 +-
>  arch/powerpc/platforms/pseries/vio.c       |   2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
>  virt/kvm/vfio.c                            |  60 ++++++
>  arch/powerpc/kvm/Kconfig                   |   1 +
>  18 files changed, 855 insertions(+), 107 deletions(-)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
  2017-03-10  4:48   ` David Gibson
@ 2017-03-14  0:54     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-14  0:54 UTC (permalink / raw)
  To: David Gibson, Alex Williamson; +Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 3518 bytes --]

On 10/03/17 15:48, David Gibson wrote:
> On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:
>> This is my current queue of patches to add acceleration of TCE
>> updates in KVM.
>>
>> This is based on Linus'es tree sha1 c1aa905a304e.
> 
> I think we're finally there - I've now sent an R-b for all patches.

Thanks for the patience.


I supposed in order to proceed now I need an ack from Alex, correct?


> 
> 
>>
>> Please comment. Thanks.
>>
>> Changes:
>> v8:
>> * kept fixing oddities with error handling in 10/10
>>
>> v7:
>> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
>>
>> v6:
>> * reworked the last patch in terms of error handling and parameters checking
>>
>> v5:
>> * replaced "KVM: PPC: Separate TCE validation from update" with
>> "KVM: PPC: iommu: Unify TCE checking"
>> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
>> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
>> * more details in individual commit logs
>>
>> v4:
>> * addressed comments from v3
>> * updated subject lines with correct component names
>> * regrouped the patchset in order:
>> 	- powerpc fixes;
>> 	- vfio_spapr_tce driver fixes;
>> 	- KVM/PPC fixes;
>> 	- KVM+PPC+VFIO;
>> * everything except last 2 patches have "Reviewed-By: David"
>>
>> v3:
>> * there was no full repost, only last patch was posted
>>
>> v2:
>> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
>> a issue;
>> * got 09/11, 10/11 to use notifiers in 11/11;
>> * added rb: David to most of patches and added a comment in 05/11.
>>
>> Alexey Kardashevskiy (10):
>>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
>>   powerpc/powernv/iommu: Add real mode version of
>>     iommu_table_ops::exchange()
>>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
>>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
>>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
>>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
>>   KVM: PPC: Pass kvm* to kvmppc_find_table()
>>   KVM: PPC: Use preregistered memory API to access TCE list
>>   KVM: PPC: iommu: Unify TCE checking
>>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
>>
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/iommu.h           |  32 ++-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
>>  arch/powerpc/include/asm/mmu_context.h     |   4 +
>>  include/uapi/linux/kvm.h                   |   9 +
>>  arch/powerpc/kernel/iommu.c                |  86 +++++---
>>  arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
>>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
>>  arch/powerpc/platforms/powernv/pci.c       |   1 +
>>  arch/powerpc/platforms/pseries/iommu.c     |   3 +-
>>  arch/powerpc/platforms/pseries/vio.c       |   2 +-
>>  drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
>>  virt/kvm/vfio.c                            |  60 ++++++
>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>  18 files changed, 855 insertions(+), 107 deletions(-)
>>
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-03-14  0:54     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-14  0:54 UTC (permalink / raw)
  To: David Gibson, Alex Williamson; +Cc: linuxppc-dev, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 3518 bytes --]

On 10/03/17 15:48, David Gibson wrote:
> On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:
>> This is my current queue of patches to add acceleration of TCE
>> updates in KVM.
>>
>> This is based on Linus'es tree sha1 c1aa905a304e.
> 
> I think we're finally there - I've now sent an R-b for all patches.

Thanks for the patience.


I supposed in order to proceed now I need an ack from Alex, correct?


> 
> 
>>
>> Please comment. Thanks.
>>
>> Changes:
>> v8:
>> * kept fixing oddities with error handling in 10/10
>>
>> v7:
>> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
>>
>> v6:
>> * reworked the last patch in terms of error handling and parameters checking
>>
>> v5:
>> * replaced "KVM: PPC: Separate TCE validation from update" with
>> "KVM: PPC: iommu: Unify TCE checking"
>> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
>> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
>> * more details in individual commit logs
>>
>> v4:
>> * addressed comments from v3
>> * updated subject lines with correct component names
>> * regrouped the patchset in order:
>> 	- powerpc fixes;
>> 	- vfio_spapr_tce driver fixes;
>> 	- KVM/PPC fixes;
>> 	- KVM+PPC+VFIO;
>> * everything except last 2 patches have "Reviewed-By: David"
>>
>> v3:
>> * there was no full repost, only last patch was posted
>>
>> v2:
>> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
>> a issue;
>> * got 09/11, 10/11 to use notifiers in 11/11;
>> * added rb: David to most of patches and added a comment in 05/11.
>>
>> Alexey Kardashevskiy (10):
>>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
>>   powerpc/powernv/iommu: Add real mode version of
>>     iommu_table_ops::exchange()
>>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
>>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
>>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
>>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
>>   KVM: PPC: Pass kvm* to kvmppc_find_table()
>>   KVM: PPC: Use preregistered memory API to access TCE list
>>   KVM: PPC: iommu: Unify TCE checking
>>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
>>
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/iommu.h           |  32 ++-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
>>  arch/powerpc/include/asm/mmu_context.h     |   4 +
>>  include/uapi/linux/kvm.h                   |   9 +
>>  arch/powerpc/kernel/iommu.c                |  86 +++++---
>>  arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
>>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
>>  arch/powerpc/platforms/powernv/pci.c       |   1 +
>>  arch/powerpc/platforms/pseries/iommu.c     |   3 +-
>>  arch/powerpc/platforms/pseries/vio.c       |   2 +-
>>  drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
>>  virt/kvm/vfio.c                            |  60 ++++++
>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>  18 files changed, 855 insertions(+), 107 deletions(-)
>>
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
  2017-03-14  0:54     ` Alexey Kardashevskiy
@ 2017-03-14  0:55       ` David Gibson
  -1 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-14  0:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4002 bytes --]

On Tue, Mar 14, 2017 at 11:54:03AM +1100, Alexey Kardashevskiy wrote:
> On 10/03/17 15:48, David Gibson wrote:
> > On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:
> >> This is my current queue of patches to add acceleration of TCE
> >> updates in KVM.
> >>
> >> This is based on Linus'es tree sha1 c1aa905a304e.
> > 
> > I think we're finally there - I've now sent an R-b for all patches.
> 
> Thanks for the patience.
> 
> 
> I supposed in order to proceed now I need an ack from Alex, correct?

That, or simply for him to merge it.

> 
> 
> > 
> > 
> >>
> >> Please comment. Thanks.
> >>
> >> Changes:
> >> v8:
> >> * kept fixing oddities with error handling in 10/10
> >>
> >> v7:
> >> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> >>
> >> v6:
> >> * reworked the last patch in terms of error handling and parameters checking
> >>
> >> v5:
> >> * replaced "KVM: PPC: Separate TCE validation from update" with
> >> "KVM: PPC: iommu: Unify TCE checking"
> >> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
> >> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> >> * more details in individual commit logs
> >>
> >> v4:
> >> * addressed comments from v3
> >> * updated subject lines with correct component names
> >> * regrouped the patchset in order:
> >> 	- powerpc fixes;
> >> 	- vfio_spapr_tce driver fixes;
> >> 	- KVM/PPC fixes;
> >> 	- KVM+PPC+VFIO;
> >> * everything except last 2 patches have "Reviewed-By: David"
> >>
> >> v3:
> >> * there was no full repost, only last patch was posted
> >>
> >> v2:
> >> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> >> a issue;
> >> * got 09/11, 10/11 to use notifiers in 11/11;
> >> * added rb: David to most of patches and added a comment in 05/11.
> >>
> >> Alexey Kardashevskiy (10):
> >>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
> >>   powerpc/powernv/iommu: Add real mode version of
> >>     iommu_table_ops::exchange()
> >>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
> >>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
> >>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
> >>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
> >>   KVM: PPC: Pass kvm* to kvmppc_find_table()
> >>   KVM: PPC: Use preregistered memory API to access TCE list
> >>   KVM: PPC: iommu: Unify TCE checking
> >>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> >>
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/iommu.h           |  32 ++-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
> >>  arch/powerpc/include/asm/mmu_context.h     |   4 +
> >>  include/uapi/linux/kvm.h                   |   9 +
> >>  arch/powerpc/kernel/iommu.c                |  86 +++++---
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
> >>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
> >>  arch/powerpc/platforms/powernv/pci.c       |   1 +
> >>  arch/powerpc/platforms/pseries/iommu.c     |   3 +-
> >>  arch/powerpc/platforms/pseries/vio.c       |   2 +-
> >>  drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>  18 files changed, 855 insertions(+), 107 deletions(-)
> >>
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-03-14  0:55       ` David Gibson
  0 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-14  0:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4002 bytes --]

On Tue, Mar 14, 2017 at 11:54:03AM +1100, Alexey Kardashevskiy wrote:
> On 10/03/17 15:48, David Gibson wrote:
> > On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:
> >> This is my current queue of patches to add acceleration of TCE
> >> updates in KVM.
> >>
> >> This is based on Linus'es tree sha1 c1aa905a304e.
> > 
> > I think we're finally there - I've now sent an R-b for all patches.
> 
> Thanks for the patience.
> 
> 
> I supposed in order to proceed now I need an ack from Alex, correct?

That, or simply for him to merge it.

> 
> 
> > 
> > 
> >>
> >> Please comment. Thanks.
> >>
> >> Changes:
> >> v8:
> >> * kept fixing oddities with error handling in 10/10
> >>
> >> v7:
> >> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> >>
> >> v6:
> >> * reworked the last patch in terms of error handling and parameters checking
> >>
> >> v5:
> >> * replaced "KVM: PPC: Separate TCE validation from update" with
> >> "KVM: PPC: iommu: Unify TCE checking"
> >> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
> >> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> >> * more details in individual commit logs
> >>
> >> v4:
> >> * addressed comments from v3
> >> * updated subject lines with correct component names
> >> * regrouped the patchset in order:
> >> 	- powerpc fixes;
> >> 	- vfio_spapr_tce driver fixes;
> >> 	- KVM/PPC fixes;
> >> 	- KVM+PPC+VFIO;
> >> * everything except last 2 patches have "Reviewed-By: David"
> >>
> >> v3:
> >> * there was no full repost, only last patch was posted
> >>
> >> v2:
> >> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> >> a issue;
> >> * got 09/11, 10/11 to use notifiers in 11/11;
> >> * added rb: David to most of patches and added a comment in 05/11.
> >>
> >> Alexey Kardashevskiy (10):
> >>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
> >>   powerpc/powernv/iommu: Add real mode version of
> >>     iommu_table_ops::exchange()
> >>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
> >>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
> >>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
> >>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
> >>   KVM: PPC: Pass kvm* to kvmppc_find_table()
> >>   KVM: PPC: Use preregistered memory API to access TCE list
> >>   KVM: PPC: iommu: Unify TCE checking
> >>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> >>
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/iommu.h           |  32 ++-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
> >>  arch/powerpc/include/asm/mmu_context.h     |   4 +
> >>  include/uapi/linux/kvm.h                   |   9 +
> >>  arch/powerpc/kernel/iommu.c                |  86 +++++---
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
> >>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
> >>  arch/powerpc/platforms/powernv/pci.c       |   1 +
> >>  arch/powerpc/platforms/pseries/iommu.c     |   3 +-
> >>  arch/powerpc/platforms/pseries/vio.c       |   2 +-
> >>  drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>  18 files changed, 855 insertions(+), 107 deletions(-)
> >>
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
  2017-03-14  0:55       ` David Gibson
@ 2017-03-14 17:59         ` Alex Williamson
  -1 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-14 17:59 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

On Tue, 14 Mar 2017 11:55:33 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Mar 14, 2017 at 11:54:03AM +1100, Alexey Kardashevskiy wrote:
> > On 10/03/17 15:48, David Gibson wrote:  
> > > On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:  
> > >> This is my current queue of patches to add acceleration of TCE
> > >> updates in KVM.
> > >>
> > >> This is based on Linus'es tree sha1 c1aa905a304e.  

Hmm, sure about that?  03/10 doesn't apply.

> > > 
> > > I think we're finally there - I've now sent an R-b for all patches.  
> > 
> > Thanks for the patience.
> > 
> > 
> > I supposed in order to proceed now I need an ack from Alex, correct?  
> 
> That, or simply for him to merge it.

Given the diffstat, I'd guess you're looking for acks from me and maybe
Paolo, but it looks like it should be merged through ppc trees.  Thanks,

Alex

> > >>
> > >> Please comment. Thanks.
> > >>
> > >> Changes:
> > >> v8:
> > >> * kept fixing oddities with error handling in 10/10
> > >>
> > >> v7:
> > >> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> > >>
> > >> v6:
> > >> * reworked the last patch in terms of error handling and parameters checking
> > >>
> > >> v5:
> > >> * replaced "KVM: PPC: Separate TCE validation from update" with
> > >> "KVM: PPC: iommu: Unify TCE checking"
> > >> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
> > >> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> > >> * more details in individual commit logs
> > >>
> > >> v4:
> > >> * addressed comments from v3
> > >> * updated subject lines with correct component names
> > >> * regrouped the patchset in order:
> > >> 	- powerpc fixes;
> > >> 	- vfio_spapr_tce driver fixes;
> > >> 	- KVM/PPC fixes;
> > >> 	- KVM+PPC+VFIO;
> > >> * everything except last 2 patches have "Reviewed-By: David"
> > >>
> > >> v3:
> > >> * there was no full repost, only last patch was posted
> > >>
> > >> v2:
> > >> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> > >> a issue;
> > >> * got 09/11, 10/11 to use notifiers in 11/11;
> > >> * added rb: David to most of patches and added a comment in 05/11.
> > >>
> > >> Alexey Kardashevskiy (10):
> > >>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
> > >>   powerpc/powernv/iommu: Add real mode version of
> > >>     iommu_table_ops::exchange()
> > >>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
> > >>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
> > >>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
> > >>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
> > >>   KVM: PPC: Pass kvm* to kvmppc_find_table()
> > >>   KVM: PPC: Use preregistered memory API to access TCE list
> > >>   KVM: PPC: iommu: Unify TCE checking
> > >>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> > >>
> > >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> > >>  arch/powerpc/include/asm/iommu.h           |  32 ++-
> > >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> > >>  arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
> > >>  arch/powerpc/include/asm/mmu_context.h     |   4 +
> > >>  include/uapi/linux/kvm.h                   |   9 +
> > >>  arch/powerpc/kernel/iommu.c                |  86 +++++---
> > >>  arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
> > >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
> > >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> > >>  arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
> > >>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
> > >>  arch/powerpc/platforms/powernv/pci.c       |   1 +
> > >>  arch/powerpc/platforms/pseries/iommu.c     |   3 +-
> > >>  arch/powerpc/platforms/pseries/vio.c       |   2 +-
> > >>  drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
> > >>  virt/kvm/vfio.c                            |  60 ++++++
> > >>  arch/powerpc/kvm/Kconfig                   |   1 +
> > >>  18 files changed, 855 insertions(+), 107 deletions(-)
> > >>  
> > >   
> > 
> >   
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-03-14 17:59         ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-14 17:59 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

On Tue, 14 Mar 2017 11:55:33 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Mar 14, 2017 at 11:54:03AM +1100, Alexey Kardashevskiy wrote:
> > On 10/03/17 15:48, David Gibson wrote:  
> > > On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:  
> > >> This is my current queue of patches to add acceleration of TCE
> > >> updates in KVM.
> > >>
> > >> This is based on Linus'es tree sha1 c1aa905a304e.  

Hmm, sure about that?  03/10 doesn't apply.

> > > 
> > > I think we're finally there - I've now sent an R-b for all patches.  
> > 
> > Thanks for the patience.
> > 
> > 
> > I supposed in order to proceed now I need an ack from Alex, correct?  
> 
> That, or simply for him to merge it.

Given the diffstat, I'd guess you're looking for acks from me and maybe
Paolo, but it looks like it should be merged through ppc trees.  Thanks,

Alex

> > >>
> > >> Please comment. Thanks.
> > >>
> > >> Changes:
> > >> v8:
> > >> * kept fixing oddities with error handling in 10/10
> > >>
> > >> v7:
> > >> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> > >>
> > >> v6:
> > >> * reworked the last patch in terms of error handling and parameters checking
> > >>
> > >> v5:
> > >> * replaced "KVM: PPC: Separate TCE validation from update" with
> > >> "KVM: PPC: iommu: Unify TCE checking"
> > >> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal"
> > >> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> > >> * more details in individual commit logs
> > >>
> > >> v4:
> > >> * addressed comments from v3
> > >> * updated subject lines with correct component names
> > >> * regrouped the patchset in order:
> > >> 	- powerpc fixes;
> > >> 	- vfio_spapr_tce driver fixes;
> > >> 	- KVM/PPC fixes;
> > >> 	- KVM+PPC+VFIO;
> > >> * everything except last 2 patches have "Reviewed-By: David"
> > >>
> > >> v3:
> > >> * there was no full repost, only last patch was posted
> > >>
> > >> v2:
> > >> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> > >> a issue;
> > >> * got 09/11, 10/11 to use notifiers in 11/11;
> > >> * added rb: David to most of patches and added a comment in 05/11.
> > >>
> > >> Alexey Kardashevskiy (10):
> > >>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
> > >>   powerpc/powernv/iommu: Add real mode version of
> > >>     iommu_table_ops::exchange()
> > >>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
> > >>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
> > >>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
> > >>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
> > >>   KVM: PPC: Pass kvm* to kvmppc_find_table()
> > >>   KVM: PPC: Use preregistered memory API to access TCE list
> > >>   KVM: PPC: iommu: Unify TCE checking
> > >>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> > >>
> > >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> > >>  arch/powerpc/include/asm/iommu.h           |  32 ++-
> > >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> > >>  arch/powerpc/include/asm/kvm_ppc.h         |  12 +-
> > >>  arch/powerpc/include/asm/mmu_context.h     |   4 +
> > >>  include/uapi/linux/kvm.h                   |   9 +
> > >>  arch/powerpc/kernel/iommu.c                |  86 +++++---
> > >>  arch/powerpc/kvm/book3s_64_vio.c           | 330 ++++++++++++++++++++++++++++-
> > >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 303 ++++++++++++++++++++++----
> > >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> > >>  arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
> > >>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
> > >>  arch/powerpc/platforms/powernv/pci.c       |   1 +
> > >>  arch/powerpc/platforms/pseries/iommu.c     |   3 +-
> > >>  arch/powerpc/platforms/pseries/vio.c       |   2 +-
> > >>  drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
> > >>  virt/kvm/vfio.c                            |  60 ++++++
> > >>  arch/powerpc/kvm/Kconfig                   |   1 +
> > >>  18 files changed, 855 insertions(+), 107 deletions(-)
> > >>  
> > >   
> > 
> >   
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  2017-03-10  3:53   ` Alexey Kardashevskiy
@ 2017-03-14 18:21     ` Alex Williamson
  -1 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-14 18:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Fri, 10 Mar 2017 14:53:30 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> At the moment iommu_table can be disposed by either calling
> iommu_table_free() directly or it_ops::free(); the only implementation
> of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
> iommu_table_free() anyway.
> 
> As we are going to have reference counting on tables, we need an unified
> way of disposing tables.
> 
> This moves it_ops::free() call into iommu_free_table() and makes use
> of the latter. The free() callback now handles only platform-specific
> data.
> 
> As from now on the iommu_free_table() calls it_ops->free(), we need
> to have it_ops initialized before calling iommu_free_table() so this
> moves this initialization in pnv_pci_ioda2_create_table().
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v5:
> * moved "tbl->it_ops = &pnv_ioda2_iommu_ops" earlier and updated
> the commit log
> ---
>  arch/powerpc/kernel/iommu.c               |  4 ++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++------
>  drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
>  3 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 9bace5df05d5..bc142d87130f 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	if (!tbl)
>  		return;
>  
> +	if (tbl->it_ops->free)
> +		tbl->it_ops->free(tbl);
> +
>  	if (!tbl->it_map) {
>  		kfree(tbl);
>  		return;
> @@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> +EXPORT_SYMBOL_GPL(iommu_free_table);

A slightly cringe worthy generically named export in arch code.

>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 69c40b43daa3..7916d0cb05fe 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	pnv_pci_ioda2_table_free_pages(tbl);
>  	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
>  }
>  
> @@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>  static void pnv_ioda2_table_free(struct iommu_table *tbl)
>  {
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
>  }
>  
>  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> @@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  	if (!tbl)
>  		return -ENOMEM;
>  
> +	tbl->it_ops = &pnv_ioda2_iommu_ops;
> +
>  	ret = pnv_pci_ioda2_table_alloc_pages(nid,
>  			bus_offset, page_shift, window_size,
>  			levels, tbl);
> @@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  		return ret;
>  	}
>  
> -	tbl->it_ops = &pnv_ioda2_iommu_ops;
> -
>  	*ptbl = tbl;
>  
>  	return 0;
> @@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		pnv_ioda2_table_free(tbl);
> +		iommu_free_table(tbl, "");
>  		return rc;
>  	}
>  
> @@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>  	if (pe->pbus)
>  		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> -	pnv_ioda2_table_free(tbl);
> +	iommu_free_table(tbl, "pnv");
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index cf3de91fbfe7..fbec7348a7e5 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl, container->mm);
> -	tbl->it_ops->free(tbl);
> +	iommu_free_table(tbl, "");
>  	decrement_locked_vm(container->mm, pages);
>  }
>  

Acked-by: Alex Williamson <alex.williamson@redhat.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
@ 2017-03-14 18:21     ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-14 18:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Fri, 10 Mar 2017 14:53:30 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> At the moment iommu_table can be disposed by either calling
> iommu_table_free() directly or it_ops::free(); the only implementation
> of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
> iommu_table_free() anyway.
> 
> As we are going to have reference counting on tables, we need an unified
> way of disposing tables.
> 
> This moves it_ops::free() call into iommu_free_table() and makes use
> of the latter. The free() callback now handles only platform-specific
> data.
> 
> As from now on the iommu_free_table() calls it_ops->free(), we need
> to have it_ops initialized before calling iommu_free_table() so this
> moves this initialization in pnv_pci_ioda2_create_table().
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v5:
> * moved "tbl->it_ops = &pnv_ioda2_iommu_ops" earlier and updated
> the commit log
> ---
>  arch/powerpc/kernel/iommu.c               |  4 ++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++------
>  drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
>  3 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 9bace5df05d5..bc142d87130f 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	if (!tbl)
>  		return;
>  
> +	if (tbl->it_ops->free)
> +		tbl->it_ops->free(tbl);
> +
>  	if (!tbl->it_map) {
>  		kfree(tbl);
>  		return;
> @@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> +EXPORT_SYMBOL_GPL(iommu_free_table);

A slightly cringe worthy generically named export in arch code.

>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 69c40b43daa3..7916d0cb05fe 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	pnv_pci_ioda2_table_free_pages(tbl);
>  	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
>  }
>  
> @@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>  static void pnv_ioda2_table_free(struct iommu_table *tbl)
>  {
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
>  }
>  
>  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> @@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  	if (!tbl)
>  		return -ENOMEM;
>  
> +	tbl->it_ops = &pnv_ioda2_iommu_ops;
> +
>  	ret = pnv_pci_ioda2_table_alloc_pages(nid,
>  			bus_offset, page_shift, window_size,
>  			levels, tbl);
> @@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  		return ret;
>  	}
>  
> -	tbl->it_ops = &pnv_ioda2_iommu_ops;
> -
>  	*ptbl = tbl;
>  
>  	return 0;
> @@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		pnv_ioda2_table_free(tbl);
> +		iommu_free_table(tbl, "");
>  		return rc;
>  	}
>  
> @@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>  	if (pe->pbus)
>  		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> -	pnv_ioda2_table_free(tbl);
> +	iommu_free_table(tbl, "pnv");
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index cf3de91fbfe7..fbec7348a7e5 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl, container->mm);
> -	tbl->it_ops->free(tbl);
> +	iommu_free_table(tbl, "");
>  	decrement_locked_vm(container->mm, pages);
>  }
>  

Acked-by: Alex Williamson <alex.williamson@redhat.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  2017-03-10  3:53   ` Alexey Kardashevskiy
@ 2017-03-14 19:58     ` Alex Williamson
  -1 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-14 19:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Fri, 10 Mar 2017 14:53:31 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> So far iommu_table obejcts were only used in virtual mode and had
> a single owner. We are going to change this by implementing in-kernel
> acceleration of DMA mapping requests. The proposed acceleration
> will handle requests in real mode and KVM will keep references to tables.
> 
> This adds a kref to iommu_table and defines new helpers to update it.
> This replaces iommu_free_table() with iommu_table_put() and makes
> iommu_free_table() static. iommu_table_get() is not used in this patch
> but it will be in the following patch.
> 
> Since this touches prototypes, this also removes @node_name parameter as
> it has never been really useful on powernv and carrying it for
> the pseries platform code to iommu_free_table() seems to be quite
> useless as well.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  arch/powerpc/include/asm/iommu.h          |  5 +++--
>  arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
>  arch/powerpc/platforms/powernv/pci.c      |  1 +
>  arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
>  arch/powerpc/platforms/pseries/vio.c      |  2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
>  7 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 4554699aec02..82e77ebf85f4 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -119,6 +119,7 @@ struct iommu_table {
>  	struct list_head it_group_list;/* List of iommu_table_group_link */
>  	unsigned long *it_userspace; /* userspace view of the table */
>  	struct iommu_table_ops *it_ops;
> +	struct kref    it_kref;
>  };
>  
>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> @@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
>  
>  extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
>  
> -/* Frees table for an individual device node */
> -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> +extern void iommu_table_get(struct iommu_table *tbl);
> +extern void iommu_table_put(struct iommu_table *tbl);
>  
>  /* Initializes an iommu_table based in values set in the passed-in
>   * structure
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index bc142d87130f..d02b8d22fb50 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
>  	return tbl;
>  }
>  
> -void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> +static void iommu_table_free(struct kref *kref)
>  {
>  	unsigned long bitmap_sz;
>  	unsigned int order;
> +	struct iommu_table *tbl;
>  
> -	if (!tbl)
> -		return;
> +	tbl = container_of(kref, struct iommu_table, it_kref);
>  
>  	if (tbl->it_ops->free)
>  		tbl->it_ops->free(tbl);
> @@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  
>  	/* verify that table contains no entries */
>  	if (!bitmap_empty(tbl->it_map, tbl->it_size))
> -		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
> +		pr_warn("%s: Unexpected TCEs\n", __func__);
>  
>  	/* calculate bitmap size in bytes */
>  	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> @@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> -EXPORT_SYMBOL_GPL(iommu_free_table);
> +
> +void iommu_table_get(struct iommu_table *tbl)
> +{
> +	kref_get(&tbl->it_kref);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_get);
> +
> +void iommu_table_put(struct iommu_table *tbl)
> +{
> +	if (!tbl)
> +		return;
> +
> +	kref_put(&tbl->it_kref, iommu_table_free);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_put);
>  


Maybe an opportunity for less cringe worthy generic names exported from
arch code.  iommu_tce_table_get/put perhaps?


>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 7916d0cb05fe..ec3e565de511 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1425,7 +1425,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
> @@ -2226,7 +2226,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
>  		__free_pages(tce_mem, get_order(tce32_segsz * segs));
>  	if (tbl) {
>  		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  	}
>  }
>  
> @@ -2322,7 +2322,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  			bus_offset, page_shift, window_size,
>  			levels, tbl);
>  	if (ret) {
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  		return ret;
>  	}
>  
> @@ -2366,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		iommu_free_table(tbl, "");
> +		iommu_table_put(tbl);
>  		return rc;
>  	}
>  
> @@ -2454,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>  	if (pe->pbus)
>  		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> @@ -3427,7 +3427,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
> @@ -3454,7 +3454,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index a43f22dc069e..9b2bdcad51ba 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
>  
>  	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  
>  	return tbl;
>  }
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 0a733ddae926..a713e20311b8 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  		goto fail_exit;
>  
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  	tgl->table_group = table_group;
>  	list_add_rcu(&tgl->next, &tbl->it_group_list);
>  
> @@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
>  		BUG_ON(table_group->group);
>  	}
>  #endif
> -	iommu_free_table(tbl, node_name);
> +	iommu_table_put(tbl);
>  
>  	kfree(table_group);
>  }
> diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
> index 720493932486..744d639da92c 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
>  	struct iommu_table *tbl = get_iommu_table_base(dev);
>  
>  	if (tbl)
> -		iommu_free_table(tbl, of_node_full_name(dev->of_node));
> +		iommu_table_put(tbl);
>  	of_node_put(dev->of_node);
>  	kfree(to_vio_dev(dev));
>  }
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index fbec7348a7e5..4f6ca9d80ead 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl, container->mm);
> -	iommu_free_table(tbl, "");
> +	iommu_table_put(tbl);
>  	decrement_locked_vm(container->mm, pages);
>  }
>  

Acked-by: Alex Williamson <alex.williamson@redhat.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
@ 2017-03-14 19:58     ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-14 19:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Fri, 10 Mar 2017 14:53:31 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> So far iommu_table obejcts were only used in virtual mode and had
> a single owner. We are going to change this by implementing in-kernel
> acceleration of DMA mapping requests. The proposed acceleration
> will handle requests in real mode and KVM will keep references to tables.
> 
> This adds a kref to iommu_table and defines new helpers to update it.
> This replaces iommu_free_table() with iommu_table_put() and makes
> iommu_free_table() static. iommu_table_get() is not used in this patch
> but it will be in the following patch.
> 
> Since this touches prototypes, this also removes @node_name parameter as
> it has never been really useful on powernv and carrying it for
> the pseries platform code to iommu_free_table() seems to be quite
> useless as well.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  arch/powerpc/include/asm/iommu.h          |  5 +++--
>  arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
>  arch/powerpc/platforms/powernv/pci.c      |  1 +
>  arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
>  arch/powerpc/platforms/pseries/vio.c      |  2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
>  7 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 4554699aec02..82e77ebf85f4 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -119,6 +119,7 @@ struct iommu_table {
>  	struct list_head it_group_list;/* List of iommu_table_group_link */
>  	unsigned long *it_userspace; /* userspace view of the table */
>  	struct iommu_table_ops *it_ops;
> +	struct kref    it_kref;
>  };
>  
>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> @@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
>  
>  extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
>  
> -/* Frees table for an individual device node */
> -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> +extern void iommu_table_get(struct iommu_table *tbl);
> +extern void iommu_table_put(struct iommu_table *tbl);
>  
>  /* Initializes an iommu_table based in values set in the passed-in
>   * structure
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index bc142d87130f..d02b8d22fb50 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
>  	return tbl;
>  }
>  
> -void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> +static void iommu_table_free(struct kref *kref)
>  {
>  	unsigned long bitmap_sz;
>  	unsigned int order;
> +	struct iommu_table *tbl;
>  
> -	if (!tbl)
> -		return;
> +	tbl = container_of(kref, struct iommu_table, it_kref);
>  
>  	if (tbl->it_ops->free)
>  		tbl->it_ops->free(tbl);
> @@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  
>  	/* verify that table contains no entries */
>  	if (!bitmap_empty(tbl->it_map, tbl->it_size))
> -		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
> +		pr_warn("%s: Unexpected TCEs\n", __func__);
>  
>  	/* calculate bitmap size in bytes */
>  	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> @@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> -EXPORT_SYMBOL_GPL(iommu_free_table);
> +
> +void iommu_table_get(struct iommu_table *tbl)
> +{
> +	kref_get(&tbl->it_kref);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_get);
> +
> +void iommu_table_put(struct iommu_table *tbl)
> +{
> +	if (!tbl)
> +		return;
> +
> +	kref_put(&tbl->it_kref, iommu_table_free);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_put);
>  


Maybe an opportunity for less cringe worthy generic names exported from
arch code.  iommu_tce_table_get/put perhaps?


>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 7916d0cb05fe..ec3e565de511 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1425,7 +1425,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
> @@ -2226,7 +2226,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
>  		__free_pages(tce_mem, get_order(tce32_segsz * segs));
>  	if (tbl) {
>  		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  	}
>  }
>  
> @@ -2322,7 +2322,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  			bus_offset, page_shift, window_size,
>  			levels, tbl);
>  	if (ret) {
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  		return ret;
>  	}
>  
> @@ -2366,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		iommu_free_table(tbl, "");
> +		iommu_table_put(tbl);
>  		return rc;
>  	}
>  
> @@ -2454,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>  	if (pe->pbus)
>  		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> @@ -3427,7 +3427,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
> @@ -3454,7 +3454,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index a43f22dc069e..9b2bdcad51ba 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
>  
>  	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  
>  	return tbl;
>  }
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 0a733ddae926..a713e20311b8 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  		goto fail_exit;
>  
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  	tgl->table_group = table_group;
>  	list_add_rcu(&tgl->next, &tbl->it_group_list);
>  
> @@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
>  		BUG_ON(table_group->group);
>  	}
>  #endif
> -	iommu_free_table(tbl, node_name);
> +	iommu_table_put(tbl);
>  
>  	kfree(table_group);
>  }
> diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
> index 720493932486..744d639da92c 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
>  	struct iommu_table *tbl = get_iommu_table_base(dev);
>  
>  	if (tbl)
> -		iommu_free_table(tbl, of_node_full_name(dev->of_node));
> +		iommu_table_put(tbl);
>  	of_node_put(dev->of_node);
>  	kfree(to_vio_dev(dev));
>  }
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index fbec7348a7e5..4f6ca9d80ead 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl, container->mm);
> -	iommu_free_table(tbl, "");
> +	iommu_table_put(tbl);
>  	decrement_locked_vm(container->mm, pages);
>  }
>  

Acked-by: Alex Williamson <alex.williamson@redhat.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-10  3:53   ` Alexey Kardashevskiy
@ 2017-03-14 21:05     ` Alex Williamson
  -1 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-14 21:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Fri, 10 Mar 2017 14:53:37 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 623 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;

kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?

> +	@flags are not supported now, must be zero;

We do this argsz/flags thing on vfio ioctls because ioctls are a bit
more of a restricted resource.  We don't want to burn through them so
we make them expandable.  I don't know that we have that restriction
here and the ADD/DEL support certainly doesn't include it.  Maybe this
isn't necessary?

> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 7bba8f415627..857ae2c6aa39 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 72c2a155641f..66de7e73b3d3 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f5a52ffb6b58..e743cb0d176e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1088,6 +1088,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1109,6 +1110,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index e96a4590464c..be18cda01e1b 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -28,6 +28,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -40,6 +44,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}


Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
iommu_group?  Why do you need to hold this additional vfio_group
reference?

>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (WARN_ON(!grp))
> +		return -EIO;
> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;
> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
> +		if ((stit->tbl == tbl) && (stit->group == group)) {
> +			ret = -EBUSY;
> +			goto put_exit;
> +		}
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> +
> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long ua,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (WARN_ON_ONCE(ret)) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_HARDWARE;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, ua = 0;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	dir = iommu_tce_direction(tce);
> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> +		return H_PARAMETER;
> +
> +	entry = ioba >> stt->page_shift;
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir == DMA_NONE) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		} else {
> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +					entry, ua, dir);
> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +		}
> +
> +		if (ret == H_SUCCESS)
> +			continue;
> +
> +		if (ret == H_TOO_HARD)
> +			return ret;
> +
> +		WARN_ON_ONCE(1);
> +		kvmppc_clear_tce(stit->tbl, entry);
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long entry, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> +				&ua, NULL))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, ua,
> +					iommu_tce_direction(tce));
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				goto unlock_exit;
> +
> +			WARN_ON_ONCE(1);
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				return ret;
> +
> +			WARN_ON_ONCE(1);
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 440d3ab5dc32..eda0a8f6fae8 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -40,6 +40,31 @@
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
>  
> +#ifdef CONFIG_BUG
> +
> +#define WARN_ON_ONCE_RM(condition)	({			\
> +	static bool __section(.data.unlikely) __warned;		\
> +	int __ret_warn_once = !!(condition);			\
> +								\
> +	if (unlikely(__ret_warn_once && !__warned)) {		\
> +		__warned = true;				\
> +		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
> +				__stringify(condition),		\
> +				__func__, __LINE__);		\
> +		dump_stack();					\
> +	}							\
> +	unlikely(__ret_warn_once);				\
> +})
> +
> +#else
> +
> +#define WARN_ON_ONCE_RM(condition) ({				\
> +	int __ret_warn_on = !!(condition);			\
> +	unlikely(__ret_warn_on);				\
> +})
> +
> +#endif
> +
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>  
>  /*
> @@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON_ONCE_RM(!pua))
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		/*
> +		 * real mode xchg can fail if struct page crosses
> +		 * a page boundary
> +		 */
> +		return H_TOO_HARD;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long ua,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON_ONCE_RM(!pua))
> +		return H_HARDWARE;
> +
> +	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		/*
> +		 * real mode xchg can fail if struct page crosses
> +		 * a page boundary
> +		 */
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, ua = 0;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	dir = iommu_tce_direction(tce);
> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> +		return H_PARAMETER;
> +
> +	entry = ioba >> stt->page_shift;
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir == DMA_NONE)
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		else
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry, ua, dir);
> +
> +		if (ret == H_SUCCESS)
> +			continue;
> +
> +		if (ret == H_TOO_HARD)
> +			return ret;
> +
> +		WARN_ON_ONCE_RM(1);
> +		kvmppc_rm_clear_tce(stit->tbl, entry);
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  			return H_TOO_HARD;
>  
>  		rmap = (void *) vmalloc_to_phys(rmap);
> +		if (WARN_ON_ONCE_RM(!rmap))
> +			return H_HARDWARE;
>  
>  		/*
>  		 * Synchronize with the MMU notifier callbacks in
> @@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		ua = 0;
> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> +				&ua, NULL))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, ua,
> +					iommu_tce_direction(tce));
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				goto unlock_exit;
> +
> +			WARN_ON_ONCE_RM(1);
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (ret == H_SUCCESS)
> +				continue;
> +
> +			if (ret == H_TOO_HARD)
> +				return ret;
> +
> +			WARN_ON_ONCE_RM(1);
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 95c91a9de351..62bdd6c48107 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */


The group reference is leaked if kvm_spapr_tce_attach_iommu_group()
fails.  My preference would be to not hold that separate group
reference in the spapr code anyway, having a parallel life cycle over
there is confusing and results in ugliness like duplicating 
kvm_vfio_group_put_external_user().  Thanks,

Alex

>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-14 21:05     ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-14 21:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Fri, 10 Mar 2017 14:53:37 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 623 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;

kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?

> +	@flags are not supported now, must be zero;

We do this argsz/flags thing on vfio ioctls because ioctls are a bit
more of a restricted resource.  We don't want to burn through them so
we make them expandable.  I don't know that we have that restriction
here and the ADD/DEL support certainly doesn't include it.  Maybe this
isn't necessary?

> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 7bba8f415627..857ae2c6aa39 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 72c2a155641f..66de7e73b3d3 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f5a52ffb6b58..e743cb0d176e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1088,6 +1088,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1109,6 +1110,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index e96a4590464c..be18cda01e1b 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -28,6 +28,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -40,6 +44,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}


Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
iommu_group?  Why do you need to hold this additional vfio_group
reference?

>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (WARN_ON(!grp))
> +		return -EIO;
> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt = f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;
> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift = stt->page_shift) &&
> +				(tbltmp->it_offset = stt->offset) &&
> +				(tbltmp->it_size = stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
> +		if ((stit->tbl = tbl) && (stit->group = group)) {
> +			ret = -EBUSY;
> +			goto put_exit;
> +		}
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> +
> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
> +		return H_HARDWARE;
> +
> +	if (dir = DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long ua,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (WARN_ON_ONCE(ret)) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_HARDWARE;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, ua = 0;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	dir = iommu_tce_direction(tce);
> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> +		return H_PARAMETER;
> +
> +	entry = ioba >> stt->page_shift;
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir = DMA_NONE) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		} else {
> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +					entry, ua, dir);
> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +		}
> +
> +		if (ret = H_SUCCESS)
> +			continue;
> +
> +		if (ret = H_TOO_HARD)
> +			return ret;
> +
> +		WARN_ON_ONCE(1);
> +		kvmppc_clear_tce(stit->tbl, entry);
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long entry, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> +				&ua, NULL))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, ua,
> +					iommu_tce_direction(tce));
> +
> +			if (ret = H_SUCCESS)
> +				continue;
> +
> +			if (ret = H_TOO_HARD)
> +				goto unlock_exit;
> +
> +			WARN_ON_ONCE(1);
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (ret = H_SUCCESS)
> +				continue;
> +
> +			if (ret = H_TOO_HARD)
> +				return ret;
> +
> +			WARN_ON_ONCE(1);
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 440d3ab5dc32..eda0a8f6fae8 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -40,6 +40,31 @@
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
>  
> +#ifdef CONFIG_BUG
> +
> +#define WARN_ON_ONCE_RM(condition)	({			\
> +	static bool __section(.data.unlikely) __warned;		\
> +	int __ret_warn_once = !!(condition);			\
> +								\
> +	if (unlikely(__ret_warn_once && !__warned)) {		\
> +		__warned = true;				\
> +		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
> +				__stringify(condition),		\
> +				__func__, __LINE__);		\
> +		dump_stack();					\
> +	}							\
> +	unlikely(__ret_warn_once);				\
> +})
> +
> +#else
> +
> +#define WARN_ON_ONCE_RM(condition) ({				\
> +	int __ret_warn_on = !!(condition);			\
> +	unlikely(__ret_warn_on);				\
> +})
> +
> +#endif
> +
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>  
>  /*
> @@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON_ONCE_RM(!pua))
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		/*
> +		 * real mode xchg can fail if struct page crosses
> +		 * a page boundary
> +		 */
> +		return H_TOO_HARD;
> +
> +	if (dir = DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long ua,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON_ONCE_RM(!pua))
> +		return H_HARDWARE;
> +
> +	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		/*
> +		 * real mode xchg can fail if struct page crosses
> +		 * a page boundary
> +		 */
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, ua = 0;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	dir = iommu_tce_direction(tce);
> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> +		return H_PARAMETER;
> +
> +	entry = ioba >> stt->page_shift;
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir = DMA_NONE)
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		else
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry, ua, dir);
> +
> +		if (ret = H_SUCCESS)
> +			continue;
> +
> +		if (ret = H_TOO_HARD)
> +			return ret;
> +
> +		WARN_ON_ONCE_RM(1);
> +		kvmppc_rm_clear_tce(stit->tbl, entry);
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  			return H_TOO_HARD;
>  
>  		rmap = (void *) vmalloc_to_phys(rmap);
> +		if (WARN_ON_ONCE_RM(!rmap))
> +			return H_HARDWARE;
>  
>  		/*
>  		 * Synchronize with the MMU notifier callbacks in
> @@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		ua = 0;
> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> +				&ua, NULL))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, ua,
> +					iommu_tce_direction(tce));
> +
> +			if (ret = H_SUCCESS)
> +				continue;
> +
> +			if (ret = H_TOO_HARD)
> +				goto unlock_exit;
> +
> +			WARN_ON_ONCE_RM(1);
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (ret = H_SUCCESS)
> +				continue;
> +
> +			if (ret = H_TOO_HARD)
> +				return ret;
> +
> +			WARN_ON_ONCE_RM(1);
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 95c91a9de351..62bdd6c48107 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */


The group reference is leaked if kvm_spapr_tce_attach_iommu_group()
fails.  My preference would be to not hold that separate group
reference in the spapr code anyway, having a parallel life cycle over
there is confusing and results in ugliness like duplicating 
kvm_vfio_group_put_external_user().  Thanks,

Alex

>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-14 21:05     ` Alex Williamson
@ 2017-03-15  4:40       ` David Gibson
  -1 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-15  4:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 35163 bytes --]

On Tue, Mar 14, 2017 at 03:05:27PM -0600, Alex Williamson wrote:
> On Fri, 10 Mar 2017 14:53:37 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> > and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> > without passing them to user space which saves time on switching
> > to user space and back.
> > 
> > This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> > KVM tries to handle a TCE request in the real mode, if failed
> > it passes the request to the virtual mode to complete the operation.
> > If it a virtual mode handler fails, the request is passed to
> > the user space; this is not expected to happen though.
> > 
> > To avoid dealing with page use counters (which is tricky in real mode),
> > this only accelerates SPAPR TCE IOMMU v2 clients which are required
> > to pre-register the userspace memory. The very first TCE request will
> > be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> > of the TCE table (iommu_table::it_userspace) is not allocated till
> > the very first mapping happens and we cannot call vmalloc in real mode.
> > 
> > If we fail to update a hardware IOMMU table unexpected reason, we just
> > clear it and move on as there is nothing really we can do about it -
> > for example, if we hot plug a VFIO device to a guest, existing TCE tables
> > will be mirrored automatically to the hardware and there is no interface
> > to report to the guest about possible failures.
> > 
> > This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> > the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> > and associates a physical IOMMU table with the SPAPR TCE table (which
> > is a guest view of the hardware IOMMU table). The iommu_table object
> > is cached and referenced so we do not have to look up for it in real mode.
> > 
> > This does not implement the UNSET counterpart as there is no use for it -
> > once the acceleration is enabled, the existing userspace won't
> > disable it unless a VFIO container is destroyed; this adds necessary
> > cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> > 
> > As this creates a descriptor per IOMMU table-LIOBN couple (called
> > kvmppc_spapr_tce_iommu_table), it is possible to have several
> > descriptors with the same iommu_table (hardware IOMMU table) attached
> > to the same LIOBN; we do not remove duplicates though as
> > iommu_table_ops::exchange not just update a TCE entry (which is
> > shared among IOMMU groups) but also invalidates the TCE cache
> > (one per IOMMU group).
> > 
> > This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> > space.
> > 
> > This adds real mode version of WARN_ON_ONCE() as the generic version
> > causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> > returns in the code, this also adds a check for already existing
> > vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> > 
> > This finally makes use of vfio_external_user_iommu_id() which was
> > introduced quite some time ago and was considered for removal.
> > 
> > Tests show that this patch increases transmission speed from 220MB/s
> > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > Changes:
> > v8:
> > * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> > to handle them
> > * changed vmalloc_to_phys() callers to return H_HARDWARE
> > * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> > and added a comment about this in the code
> > * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> > and do WARN_ON
> > * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> > have all vmalloc_to_phys() callsites covered
> > 
> > v7:
> > * added realmode-friendly WARN_ON_ONCE_RM
> > 
> > v6:
> > * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> > * moved kvmppc_gpa_to_ua() to TCE validation
> > 
> > v5:
> > * changed error codes in multiple places
> > * added bunch of WARN_ON() in places which should not really happen
> > * adde a check that an iommu table is not attached already to LIOBN
> > * dropped explicit calls to iommu_tce_clear_param_check/
> > iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> > call them anyway (since the previous patch)
> > * if we fail to update a hardware IOMMU table for unexpected reason,
> > this just clears the entry
> > 
> > v4:
> > * added note to the commit log about allowing multiple updates of
> > the same IOMMU table;
> > * instead of checking for if any memory was preregistered, this
> > returns H_TOO_HARD if a specific page was not;
> > * fixed comments from v3 about error handling in many places;
> > * simplified TCE handlers and merged IOMMU parts inline - for example,
> > there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> > kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> > the first attached table only (makes the code simpler);
> > 
> > v3:
> > * simplified not to use VFIO group notifiers
> > * reworked cleanup, should be cleaner/simpler now
> > 
> > v2:
> > * reworked to use new VFIO notifiers
> > * now same iommu_table may appear in the list several times, to be fixed later
> > ---
> >  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >  include/uapi/linux/kvm.h                   |   8 +
> >  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
> >  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
> >  arch/powerpc/kvm/powerpc.c                 |   2 +
> >  virt/kvm/vfio.c                            |  60 ++++++
> >  8 files changed, 623 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> > index ef51740c67ca..f95d867168ea 100644
> > --- a/Documentation/virtual/kvm/devices/vfio.txt
> > +++ b/Documentation/virtual/kvm/devices/vfio.txt
> > @@ -16,7 +16,25 @@ Groups:
> >  
> >  KVM_DEV_VFIO_GROUP attributes:
> >    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> > +	kvm_device_attr.addr points to an int32_t file descriptor
> > +	for the VFIO group.
> >    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> > +	kvm_device_attr.addr points to an int32_t file descriptor
> > +	for the VFIO group.
> > +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> > +	allocated by sPAPR KVM.
> > +	kvm_device_attr.addr points to a struct:
> >  
> > -For each, kvm_device_attr.addr points to an int32_t file descriptor
> > -for the VFIO group.
> > +	struct kvm_vfio_spapr_tce {
> > +		__u32	argsz;
> > +		__u32	flags;
> > +		__s32	groupfd;
> > +		__s32	tablefd;
> > +	};
> > +
> > +	where
> > +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> 
> kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?
> 
> > +	@flags are not supported now, must be zero;
> 
> We do this argsz/flags thing on vfio ioctls because ioctls are a bit
> more of a restricted resource.  We don't want to burn through them so
> we make them expandable.  I don't know that we have that restriction
> here and the ADD/DEL support certainly doesn't include it.  Maybe this
> isn't necessary?

I didn't comment on this before, but I tend to agree with Alex here.

> > +	@groupfd is a file descriptor for a VFIO group;
> > +	@tablefd is a file descriptor for a TCE table allocated via
> > +		KVM_CREATE_SPAPR_TCE.
> > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> > index 7bba8f415627..857ae2c6aa39 100644
> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >  	atomic_t refcnt;
> >  };
> >  
> > +struct kvmppc_spapr_tce_iommu_table {
> > +	struct rcu_head rcu;
> > +	struct list_head next;
> > +	struct vfio_group *group;
> > +	struct iommu_table *tbl;
> > +};
> > +
> >  struct kvmppc_spapr_tce_table {
> >  	struct list_head list;
> >  	struct kvm *kvm;
> > @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >  	u32 page_shift;
> >  	u64 offset;		/* in pages */
> >  	u64 size;		/* window size in pages */
> > +	struct list_head iommu_tables;
> >  	struct page *pages[0];
> >  };
> >  
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > index 72c2a155641f..66de7e73b3d3 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >  			struct kvm_memory_slot *memslot, unsigned long porder);
> >  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> > +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> > +		struct vfio_group *group);
> > +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> > +		struct vfio_group *group);
> >  
> >  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >  				struct kvm_create_spapr_tce_64 *args);
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index f5a52ffb6b58..e743cb0d176e 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1088,6 +1088,7 @@ struct kvm_device_attr {
> >  #define  KVM_DEV_VFIO_GROUP			1
> >  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >  #define   KVM_DEV_VFIO_GROUP_DEL			2
> > +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >  
> >  enum kvm_device_type {
> >  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> > @@ -1109,6 +1110,13 @@ enum kvm_device_type {
> >  	KVM_DEV_TYPE_MAX,
> >  };
> >  
> > +struct kvm_vfio_spapr_tce {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +	__s32	groupfd;
> > +	__s32	tablefd;
> > +};
> > +
> >  /*
> >   * ioctls for VM fds
> >   */
> > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > index e96a4590464c..be18cda01e1b 100644
> > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > @@ -28,6 +28,10 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/list.h>
> >  #include <linux/anon_inodes.h>
> > +#include <linux/iommu.h>
> > +#include <linux/file.h>
> > +#include <linux/vfio.h>
> > +#include <linux/module.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include <asm/kvm_ppc.h>
> > @@ -40,6 +44,36 @@
> >  #include <asm/udbg.h>
> >  #include <asm/iommu.h>
> >  #include <asm/tce.h>
> > +#include <asm/mmu_context.h>
> > +
> > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > +{
> > +	void (*fn)(struct vfio_group *);
> > +
> > +	fn = symbol_get(vfio_group_put_external_user);
> > +	if (WARN_ON(!fn))
> > +		return;
> > +
> > +	fn(vfio_group);
> > +
> > +	symbol_put(vfio_group_put_external_user);
> > +}
> > +
> > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > +{
> > +	int (*fn)(struct vfio_group *);
> > +	int ret = -1;
> > +
> > +	fn = symbol_get(vfio_external_user_iommu_id);
> > +	if (!fn)
> > +		return ret;
> > +
> > +	ret = fn(vfio_group);
> > +
> > +	symbol_put(vfio_external_user_iommu_id);
> > +
> > +	return ret;
> > +}
> 
> 
> Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> iommu_group?  Why do you need to hold this additional vfio_group
> reference?

Keeping the vfio_group reference makes sense to me, since we don't
want the vfio context for the group to go away while it's attached to
the LIOBN.

However, going via the iommu_id rather than just having an interface
to directly grab the iommu group from the vfio_group seems bizarre to
me.  I'm ok with cleaning that up later, however.

> 
> >  
> >  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >  {
> > @@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >  	return ret;
> >  }
> >  
> > +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> > +{
> > +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> > +			struct kvmppc_spapr_tce_iommu_table, rcu);
> > +
> > +	iommu_table_put(stit->tbl);
> > +	kvm_vfio_group_put_external_user(stit->group);
> > +
> > +	kfree(stit);
> > +}
> > +
> > +static void kvm_spapr_tce_liobn_release_iommu_group(
> > +		struct kvmppc_spapr_tce_table *stt,
> > +		struct vfio_group *group)
> > +{
> > +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> > +
> > +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> > +		if (group && (stit->group != group))
> > +			continue;
> > +
> > +		list_del_rcu(&stit->next);
> > +
> > +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> > +	}
> > +}
> > +
> > +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> > +		struct vfio_group *group)
> > +{
> > +	struct kvmppc_spapr_tce_table *stt;
> > +
> > +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> > +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> > +}
> > +
> > +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> > +		struct vfio_group *group)
> > +{
> > +	struct kvmppc_spapr_tce_table *stt = NULL;
> > +	bool found = false;
> > +	struct iommu_table *tbl = NULL;
> > +	struct iommu_table_group *table_group;
> > +	long i, ret = 0;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> > +	struct fd f;
> > +	int group_id;
> > +	struct iommu_group *grp;
> > +
> > +	group_id = kvm_vfio_external_user_iommu_id(group);
> > +	grp = iommu_group_get_by_id(group_id);
> > +	if (WARN_ON(!grp))
> > +		return -EIO;
> > +
> > +	f = fdget(tablefd);
> > +	if (!f.file) {
> > +		ret = -EBADF;
> > +		goto put_exit;
> > +	}
> > +
> > +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> > +		if (stt == f.file->private_data) {
> > +			found = true;
> > +			break;
> > +		}
> > +	}
> > +
> > +	fdput(f);
> > +
> > +	if (!found) {
> > +		ret = -EINVAL;
> > +		goto put_exit;
> > +	}
> > +
> > +	table_group = iommu_group_get_iommudata(grp);
> > +	if (WARN_ON(!table_group)) {
> > +		ret = -EFAULT;
> > +		goto put_exit;
> > +	}
> > +
> > +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> > +		struct iommu_table *tbltmp = table_group->tables[i];
> > +
> > +		if (!tbltmp)
> > +			continue;
> > +
> > +		/*
> > +		 * Make sure hardware table parameters are exactly the same;
> > +		 * this is used in the TCE handlers where boundary checks
> > +		 * use only the first attached table.
> > +		 */
> > +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> > +				(tbltmp->it_offset == stt->offset) &&
> > +				(tbltmp->it_size == stt->size)) {
> > +			tbl = tbltmp;
> > +			break;
> > +		}
> > +	}
> > +	if (!tbl) {
> > +		ret = -EINVAL;
> > +		goto put_exit;
> > +	}
> > +
> > +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
> > +		if ((stit->tbl == tbl) && (stit->group == group)) {
> > +			ret = -EBUSY;
> > +			goto put_exit;
> > +		}
> > +	}
> > +
> > +	iommu_table_get(tbl);
> > +
> > +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> > +	stit->tbl = tbl;
> > +	stit->group = group;
> > +
> > +	list_add_rcu(&stit->next, &stt->iommu_tables);
> > +
> > +put_exit:
> > +	iommu_group_put(grp);
> > +
> > +	return ret;
> > +}
> > +
> >  static void release_spapr_tce_table(struct rcu_head *head)
> >  {
> >  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> > @@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >  
> >  	list_del_rcu(&stt->list);
> >  
> > +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> > +
> >  	kvm_put_kvm(stt->kvm);
> >  
> >  	kvmppc_account_memlimit(
> > @@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >  	stt->offset = args->offset;
> >  	stt->size = size;
> >  	stt->kvm = kvm;
> > +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >  
> >  	for (i = 0; i < npages; i++) {
> >  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > @@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >  	return ret;
> >  }
> >  
> > +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	unsigned long hpa = 0;
> > +	enum dma_data_direction dir = DMA_NONE;
> > +
> > +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> > +}
> > +
> > +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> > +		struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	struct mm_iommu_table_group_mem_t *mem = NULL;
> > +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> > +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> > +
> > +	if (!pua)
> > +		/* it_userspace allocation might be delayed */
> > +		return H_TOO_HARD;
> > +
> > +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> > +	if (!mem)
> > +		return H_TOO_HARD;
> > +
> > +	mm_iommu_mapped_dec(mem);
> > +
> > +	*pua = 0;
> > +
> > +	return H_SUCCESS;
> > +}
> > +
> > +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> > +		struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	enum dma_data_direction dir = DMA_NONE;
> > +	unsigned long hpa = 0;
> > +	long ret;
> > +
> > +	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
> > +		return H_HARDWARE;
> > +
> > +	if (dir == DMA_NONE)
> > +		return H_SUCCESS;
> > +
> > +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> > +	if (ret != H_SUCCESS)
> > +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> > +
> > +	return ret;
> > +}
> > +
> > +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> > +		unsigned long entry, unsigned long ua,
> > +		enum dma_data_direction dir)
> > +{
> > +	long ret;
> > +	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> > +	struct mm_iommu_table_group_mem_t *mem;
> > +
> > +	if (!pua)
> > +		/* it_userspace allocation might be delayed */
> > +		return H_TOO_HARD;
> > +
> > +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> > +	if (!mem)
> > +		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> > +		return H_TOO_HARD;
> > +
> > +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> > +		return H_HARDWARE;
> > +
> > +	if (mm_iommu_mapped_inc(mem))
> > +		return H_CLOSED;
> > +
> > +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> > +	if (WARN_ON_ONCE(ret)) {
> > +		mm_iommu_mapped_dec(mem);
> > +		return H_HARDWARE;
> > +	}
> > +
> > +	if (dir != DMA_NONE)
> > +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> > +
> > +	*pua = ua;
> > +
> > +	return 0;
> > +}
> > +
> >  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >  		      unsigned long ioba, unsigned long tce)
> >  {
> >  	struct kvmppc_spapr_tce_table *stt;
> > -	long ret;
> > +	long ret, idx;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> > +	unsigned long entry, ua = 0;
> > +	enum dma_data_direction dir;
> >  
> >  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >  	/* 	    liobn, ioba, tce); */
> > @@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >  	if (ret != H_SUCCESS)
> >  		return ret;
> >  
> > -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> > +	dir = iommu_tce_direction(tce);
> > +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> > +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> > +		return H_PARAMETER;
> > +
> > +	entry = ioba >> stt->page_shift;
> > +
> > +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +		if (dir == DMA_NONE) {
> > +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> > +					stit->tbl, entry);
> > +		} else {
> > +			idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> > +					entry, ua, dir);
> > +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +		}
> > +
> > +		if (ret == H_SUCCESS)
> > +			continue;
> > +
> > +		if (ret == H_TOO_HARD)
> > +			return ret;
> > +
> > +		WARN_ON_ONCE(1);
> > +		kvmppc_clear_tce(stit->tbl, entry);
> > +	}
> > +
> > +	kvmppc_tce_put(stt, entry, tce);
> >  
> >  	return H_SUCCESS;
> >  }
> > @@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  	unsigned long entry, ua = 0;
> >  	u64 __user *tces;
> >  	u64 tce;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> >  
> >  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >  	if (!stt)
> > @@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  		if (ret != H_SUCCESS)
> >  			goto unlock_exit;
> >  
> > +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> > +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> > +				&ua, NULL))
> > +			return H_PARAMETER;
> > +
> > +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
> > +					stit->tbl, entry + i, ua,
> > +					iommu_tce_direction(tce));
> > +
> > +			if (ret == H_SUCCESS)
> > +				continue;
> > +
> > +			if (ret == H_TOO_HARD)
> > +				goto unlock_exit;
> > +
> > +			WARN_ON_ONCE(1);
> > +			kvmppc_clear_tce(stit->tbl, entry);
> > +		}
> > +
> >  		kvmppc_tce_put(stt, entry + i, tce);
> >  	}
> >  
> > @@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >  {
> >  	struct kvmppc_spapr_tce_table *stt;
> >  	long i, ret;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> >  
> >  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >  	if (!stt)
> > @@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >  		return H_PARAMETER;
> >  
> > +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> > +					stit->tbl, entry + i);
> > +
> > +			if (ret == H_SUCCESS)
> > +				continue;
> > +
> > +			if (ret == H_TOO_HARD)
> > +				return ret;
> > +
> > +			WARN_ON_ONCE(1);
> > +			kvmppc_clear_tce(stit->tbl, entry);
> > +		}
> > +	}
> > +
> >  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >  
> > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > index 440d3ab5dc32..eda0a8f6fae8 100644
> > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > @@ -40,6 +40,31 @@
> >  #include <asm/iommu.h>
> >  #include <asm/tce.h>
> >  
> > +#ifdef CONFIG_BUG
> > +
> > +#define WARN_ON_ONCE_RM(condition)	({			\
> > +	static bool __section(.data.unlikely) __warned;		\
> > +	int __ret_warn_once = !!(condition);			\
> > +								\
> > +	if (unlikely(__ret_warn_once && !__warned)) {		\
> > +		__warned = true;				\
> > +		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
> > +				__stringify(condition),		\
> > +				__func__, __LINE__);		\
> > +		dump_stack();					\
> > +	}							\
> > +	unlikely(__ret_warn_once);				\
> > +})
> > +
> > +#else
> > +
> > +#define WARN_ON_ONCE_RM(condition) ({				\
> > +	int __ret_warn_on = !!(condition);			\
> > +	unlikely(__ret_warn_on);				\
> > +})
> > +
> > +#endif
> > +
> >  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> >  
> >  /*
> > @@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >  
> >  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> > +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	unsigned long hpa = 0;
> > +	enum dma_data_direction dir = DMA_NONE;
> > +
> > +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> > +}
> > +
> > +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> > +		struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	struct mm_iommu_table_group_mem_t *mem = NULL;
> > +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> > +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> > +
> > +	if (!pua)
> > +		/* it_userspace allocation might be delayed */
> > +		return H_TOO_HARD;
> > +
> > +	pua = (void *) vmalloc_to_phys(pua);
> > +	if (WARN_ON_ONCE_RM(!pua))
> > +		return H_HARDWARE;
> > +
> > +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> > +	if (!mem)
> > +		return H_TOO_HARD;
> > +
> > +	mm_iommu_mapped_dec(mem);
> > +
> > +	*pua = 0;
> > +
> > +	return H_SUCCESS;
> > +}
> > +
> > +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> > +		struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	enum dma_data_direction dir = DMA_NONE;
> > +	unsigned long hpa = 0;
> > +	long ret;
> > +
> > +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> > +		/*
> > +		 * real mode xchg can fail if struct page crosses
> > +		 * a page boundary
> > +		 */
> > +		return H_TOO_HARD;
> > +
> > +	if (dir == DMA_NONE)
> > +		return H_SUCCESS;
> > +
> > +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> > +	if (ret)
> > +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> > +
> > +	return ret;
> > +}
> > +
> > +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> > +		unsigned long entry, unsigned long ua,
> > +		enum dma_data_direction dir)
> > +{
> > +	long ret;
> > +	unsigned long hpa = 0;
> > +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> > +	struct mm_iommu_table_group_mem_t *mem;
> > +
> > +	if (!pua)
> > +		/* it_userspace allocation might be delayed */
> > +		return H_TOO_HARD;
> > +
> > +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> > +	if (!mem)
> > +		return H_TOO_HARD;
> > +
> > +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> > +		return H_HARDWARE;
> > +
> > +	pua = (void *) vmalloc_to_phys(pua);
> > +	if (WARN_ON_ONCE_RM(!pua))
> > +		return H_HARDWARE;
> > +
> > +	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
> > +		return H_CLOSED;
> > +
> > +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> > +	if (ret) {
> > +		mm_iommu_mapped_dec(mem);
> > +		/*
> > +		 * real mode xchg can fail if struct page crosses
> > +		 * a page boundary
> > +		 */
> > +		return H_TOO_HARD;
> > +	}
> > +
> > +	if (dir != DMA_NONE)
> > +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> > +
> > +	*pua = ua;
> > +
> > +	return 0;
> > +}
> > +
> >  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >  		unsigned long ioba, unsigned long tce)
> >  {
> >  	struct kvmppc_spapr_tce_table *stt;
> >  	long ret;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> > +	unsigned long entry, ua = 0;
> > +	enum dma_data_direction dir;
> >  
> >  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >  	/* 	    liobn, ioba, tce); */
> > @@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >  	if (ret != H_SUCCESS)
> >  		return ret;
> >  
> > -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> > +	dir = iommu_tce_direction(tce);
> > +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> > +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> > +		return H_PARAMETER;
> > +
> > +	entry = ioba >> stt->page_shift;
> > +
> > +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +		if (dir == DMA_NONE)
> > +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> > +					stit->tbl, entry);
> > +		else
> > +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> > +					stit->tbl, entry, ua, dir);
> > +
> > +		if (ret == H_SUCCESS)
> > +			continue;
> > +
> > +		if (ret == H_TOO_HARD)
> > +			return ret;
> > +
> > +		WARN_ON_ONCE_RM(1);
> > +		kvmppc_rm_clear_tce(stit->tbl, entry);
> > +	}
> > +
> > +	kvmppc_tce_put(stt, entry, tce);
> >  
> >  	return H_SUCCESS;
> >  }
> > @@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  	unsigned long tces, entry, ua = 0;
> >  	unsigned long *rmap = NULL;
> >  	bool prereg = false;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> >  
> >  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >  	if (!stt)
> > @@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  			return H_TOO_HARD;
> >  
> >  		rmap = (void *) vmalloc_to_phys(rmap);
> > +		if (WARN_ON_ONCE_RM(!rmap))
> > +			return H_HARDWARE;
> >  
> >  		/*
> >  		 * Synchronize with the MMU notifier callbacks in
> > @@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  		if (ret != H_SUCCESS)
> >  			goto unlock_exit;
> >  
> > +		ua = 0;
> > +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> > +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> > +				&ua, NULL))
> > +			return H_PARAMETER;
> > +
> > +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> > +					stit->tbl, entry + i, ua,
> > +					iommu_tce_direction(tce));
> > +
> > +			if (ret == H_SUCCESS)
> > +				continue;
> > +
> > +			if (ret == H_TOO_HARD)
> > +				goto unlock_exit;
> > +
> > +			WARN_ON_ONCE_RM(1);
> > +			kvmppc_rm_clear_tce(stit->tbl, entry);
> > +		}
> > +
> >  		kvmppc_tce_put(stt, entry + i, tce);
> >  	}
> >  
> > @@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >  {
> >  	struct kvmppc_spapr_tce_table *stt;
> >  	long i, ret;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> >  
> >  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >  	if (!stt)
> > @@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >  		return H_PARAMETER;
> >  
> > +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> > +					stit->tbl, entry + i);
> > +
> > +			if (ret == H_SUCCESS)
> > +				continue;
> > +
> > +			if (ret == H_TOO_HARD)
> > +				return ret;
> > +
> > +			WARN_ON_ONCE_RM(1);
> > +			kvmppc_rm_clear_tce(stit->tbl, entry);
> > +		}
> > +	}
> > +
> >  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >  
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 95c91a9de351..62bdd6c48107 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  #ifdef CONFIG_PPC_BOOK3S_64
> >  	case KVM_CAP_SPAPR_TCE:
> >  	case KVM_CAP_SPAPR_TCE_64:
> > +		/* fallthrough */
> > +	case KVM_CAP_SPAPR_TCE_VFIO:
> >  	case KVM_CAP_PPC_RTAS:
> >  	case KVM_CAP_PPC_FIXUP_HCALL:
> >  	case KVM_CAP_PPC_ENABLE_HCALL:
> > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> > index d32f239eb471..2b7dc22265fe 100644
> > --- a/virt/kvm/vfio.c
> > +++ b/virt/kvm/vfio.c
> > @@ -20,6 +20,10 @@
> >  #include <linux/vfio.h>
> >  #include "vfio.h"
> >  
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +#include <asm/kvm_ppc.h>
> > +#endif
> > +
> >  struct kvm_vfio_group {
> >  	struct list_head node;
> >  	struct vfio_group *vfio_group;
> > @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >  
> >  		mutex_unlock(&kv->lock);
> >  
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> > +#endif
> >  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >  
> >  		kvm_vfio_group_put_external_user(vfio_group);
> > @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >  		kvm_vfio_update_coherency(dev);
> >  
> >  		return ret;
> > +
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> > +		struct kvm_vfio_spapr_tce param;
> > +		unsigned long minsz;
> > +		struct kvm_vfio *kv = dev->private;
> > +		struct vfio_group *vfio_group;
> > +		struct kvm_vfio_group *kvg;
> > +		struct fd f;
> > +
> > +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> > +
> > +		if (copy_from_user(&param, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (param.argsz < minsz || param.flags)
> > +			return -EINVAL;
> > +
> > +		f = fdget(param.groupfd);
> > +		if (!f.file)
> > +			return -EBADF;
> > +
> > +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> > +		fdput(f);
> > +
> > +		if (IS_ERR(vfio_group))
> > +			return PTR_ERR(vfio_group);
> > +
> > +		ret = -ENOENT;
> > +
> > +		mutex_lock(&kv->lock);
> > +
> > +		list_for_each_entry(kvg, &kv->group_list, node) {
> > +			if (kvg->vfio_group != vfio_group)
> > +				continue;
> > +
> > +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> > +					param.tablefd, vfio_group);
> > +
> > +			break;
> > +		}
> > +
> > +		mutex_unlock(&kv->lock);
> > +
> > +		return ret;
> > +	}
> > +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> 
> 
> The group reference is leaked if kvm_spapr_tce_attach_iommu_group()
> fails.

Good catch.

> My preference would be to not hold that separate group
> reference in the spapr code anyway, having a parallel life cycle over
> there is confusing and results in ugliness like duplicating 
> kvm_vfio_group_put_external_user().  Thanks,
> 
> Alex
> 
> >  	}
> >  
> >  	return -ENXIO;
> > @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >  		switch (attr->attr) {
> >  		case KVM_DEV_VFIO_GROUP_ADD:
> >  		case KVM_DEV_VFIO_GROUP_DEL:
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> > +#endif
> >  			return 0;
> >  		}
> >  
> > @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >  	struct kvm_vfio_group *kvg, *tmp;
> >  
> >  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> > +#endif
> >  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >  		list_del(&kvg->node);
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-15  4:40       ` David Gibson
  0 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-15  4:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 35163 bytes --]

On Tue, Mar 14, 2017 at 03:05:27PM -0600, Alex Williamson wrote:
> On Fri, 10 Mar 2017 14:53:37 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> > and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> > without passing them to user space which saves time on switching
> > to user space and back.
> > 
> > This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> > KVM tries to handle a TCE request in the real mode, if failed
> > it passes the request to the virtual mode to complete the operation.
> > If it a virtual mode handler fails, the request is passed to
> > the user space; this is not expected to happen though.
> > 
> > To avoid dealing with page use counters (which is tricky in real mode),
> > this only accelerates SPAPR TCE IOMMU v2 clients which are required
> > to pre-register the userspace memory. The very first TCE request will
> > be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> > of the TCE table (iommu_table::it_userspace) is not allocated till
> > the very first mapping happens and we cannot call vmalloc in real mode.
> > 
> > If we fail to update a hardware IOMMU table unexpected reason, we just
> > clear it and move on as there is nothing really we can do about it -
> > for example, if we hot plug a VFIO device to a guest, existing TCE tables
> > will be mirrored automatically to the hardware and there is no interface
> > to report to the guest about possible failures.
> > 
> > This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> > the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> > and associates a physical IOMMU table with the SPAPR TCE table (which
> > is a guest view of the hardware IOMMU table). The iommu_table object
> > is cached and referenced so we do not have to look up for it in real mode.
> > 
> > This does not implement the UNSET counterpart as there is no use for it -
> > once the acceleration is enabled, the existing userspace won't
> > disable it unless a VFIO container is destroyed; this adds necessary
> > cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> > 
> > As this creates a descriptor per IOMMU table-LIOBN couple (called
> > kvmppc_spapr_tce_iommu_table), it is possible to have several
> > descriptors with the same iommu_table (hardware IOMMU table) attached
> > to the same LIOBN; we do not remove duplicates though as
> > iommu_table_ops::exchange not just update a TCE entry (which is
> > shared among IOMMU groups) but also invalidates the TCE cache
> > (one per IOMMU group).
> > 
> > This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> > space.
> > 
> > This adds real mode version of WARN_ON_ONCE() as the generic version
> > causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> > returns in the code, this also adds a check for already existing
> > vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> > 
> > This finally makes use of vfio_external_user_iommu_id() which was
> > introduced quite some time ago and was considered for removal.
> > 
> > Tests show that this patch increases transmission speed from 220MB/s
> > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > Changes:
> > v8:
> > * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> > to handle them
> > * changed vmalloc_to_phys() callers to return H_HARDWARE
> > * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> > and added a comment about this in the code
> > * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> > and do WARN_ON
> > * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> > have all vmalloc_to_phys() callsites covered
> > 
> > v7:
> > * added realmode-friendly WARN_ON_ONCE_RM
> > 
> > v6:
> > * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> > * moved kvmppc_gpa_to_ua() to TCE validation
> > 
> > v5:
> > * changed error codes in multiple places
> > * added bunch of WARN_ON() in places which should not really happen
> > * adde a check that an iommu table is not attached already to LIOBN
> > * dropped explicit calls to iommu_tce_clear_param_check/
> > iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> > call them anyway (since the previous patch)
> > * if we fail to update a hardware IOMMU table for unexpected reason,
> > this just clears the entry
> > 
> > v4:
> > * added note to the commit log about allowing multiple updates of
> > the same IOMMU table;
> > * instead of checking for if any memory was preregistered, this
> > returns H_TOO_HARD if a specific page was not;
> > * fixed comments from v3 about error handling in many places;
> > * simplified TCE handlers and merged IOMMU parts inline - for example,
> > there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> > kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> > the first attached table only (makes the code simpler);
> > 
> > v3:
> > * simplified not to use VFIO group notifiers
> > * reworked cleanup, should be cleaner/simpler now
> > 
> > v2:
> > * reworked to use new VFIO notifiers
> > * now same iommu_table may appear in the list several times, to be fixed later
> > ---
> >  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >  include/uapi/linux/kvm.h                   |   8 +
> >  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
> >  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
> >  arch/powerpc/kvm/powerpc.c                 |   2 +
> >  virt/kvm/vfio.c                            |  60 ++++++
> >  8 files changed, 623 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> > index ef51740c67ca..f95d867168ea 100644
> > --- a/Documentation/virtual/kvm/devices/vfio.txt
> > +++ b/Documentation/virtual/kvm/devices/vfio.txt
> > @@ -16,7 +16,25 @@ Groups:
> >  
> >  KVM_DEV_VFIO_GROUP attributes:
> >    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> > +	kvm_device_attr.addr points to an int32_t file descriptor
> > +	for the VFIO group.
> >    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> > +	kvm_device_attr.addr points to an int32_t file descriptor
> > +	for the VFIO group.
> > +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> > +	allocated by sPAPR KVM.
> > +	kvm_device_attr.addr points to a struct:
> >  
> > -For each, kvm_device_attr.addr points to an int32_t file descriptor
> > -for the VFIO group.
> > +	struct kvm_vfio_spapr_tce {
> > +		__u32	argsz;
> > +		__u32	flags;
> > +		__s32	groupfd;
> > +		__s32	tablefd;
> > +	};
> > +
> > +	where
> > +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> 
> kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?
> 
> > +	@flags are not supported now, must be zero;
> 
> We do this argsz/flags thing on vfio ioctls because ioctls are a bit
> more of a restricted resource.  We don't want to burn through them so
> we make them expandable.  I don't know that we have that restriction
> here and the ADD/DEL support certainly doesn't include it.  Maybe this
> isn't necessary?

I didn't comment on this before, but I tend to agree with Alex here.

> > +	@groupfd is a file descriptor for a VFIO group;
> > +	@tablefd is a file descriptor for a TCE table allocated via
> > +		KVM_CREATE_SPAPR_TCE.
> > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> > index 7bba8f415627..857ae2c6aa39 100644
> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >  	atomic_t refcnt;
> >  };
> >  
> > +struct kvmppc_spapr_tce_iommu_table {
> > +	struct rcu_head rcu;
> > +	struct list_head next;
> > +	struct vfio_group *group;
> > +	struct iommu_table *tbl;
> > +};
> > +
> >  struct kvmppc_spapr_tce_table {
> >  	struct list_head list;
> >  	struct kvm *kvm;
> > @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >  	u32 page_shift;
> >  	u64 offset;		/* in pages */
> >  	u64 size;		/* window size in pages */
> > +	struct list_head iommu_tables;
> >  	struct page *pages[0];
> >  };
> >  
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > index 72c2a155641f..66de7e73b3d3 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >  			struct kvm_memory_slot *memslot, unsigned long porder);
> >  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> > +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> > +		struct vfio_group *group);
> > +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> > +		struct vfio_group *group);
> >  
> >  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >  				struct kvm_create_spapr_tce_64 *args);
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index f5a52ffb6b58..e743cb0d176e 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1088,6 +1088,7 @@ struct kvm_device_attr {
> >  #define  KVM_DEV_VFIO_GROUP			1
> >  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >  #define   KVM_DEV_VFIO_GROUP_DEL			2
> > +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >  
> >  enum kvm_device_type {
> >  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> > @@ -1109,6 +1110,13 @@ enum kvm_device_type {
> >  	KVM_DEV_TYPE_MAX,
> >  };
> >  
> > +struct kvm_vfio_spapr_tce {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +	__s32	groupfd;
> > +	__s32	tablefd;
> > +};
> > +
> >  /*
> >   * ioctls for VM fds
> >   */
> > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > index e96a4590464c..be18cda01e1b 100644
> > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > @@ -28,6 +28,10 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/list.h>
> >  #include <linux/anon_inodes.h>
> > +#include <linux/iommu.h>
> > +#include <linux/file.h>
> > +#include <linux/vfio.h>
> > +#include <linux/module.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include <asm/kvm_ppc.h>
> > @@ -40,6 +44,36 @@
> >  #include <asm/udbg.h>
> >  #include <asm/iommu.h>
> >  #include <asm/tce.h>
> > +#include <asm/mmu_context.h>
> > +
> > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > +{
> > +	void (*fn)(struct vfio_group *);
> > +
> > +	fn = symbol_get(vfio_group_put_external_user);
> > +	if (WARN_ON(!fn))
> > +		return;
> > +
> > +	fn(vfio_group);
> > +
> > +	symbol_put(vfio_group_put_external_user);
> > +}
> > +
> > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > +{
> > +	int (*fn)(struct vfio_group *);
> > +	int ret = -1;
> > +
> > +	fn = symbol_get(vfio_external_user_iommu_id);
> > +	if (!fn)
> > +		return ret;
> > +
> > +	ret = fn(vfio_group);
> > +
> > +	symbol_put(vfio_external_user_iommu_id);
> > +
> > +	return ret;
> > +}
> 
> 
> Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> iommu_group?  Why do you need to hold this additional vfio_group
> reference?

Keeping the vfio_group reference makes sense to me, since we don't
want the vfio context for the group to go away while it's attached to
the LIOBN.

However, going via the iommu_id rather than just having an interface
to directly grab the iommu group from the vfio_group seems bizarre to
me.  I'm ok with cleaning that up later, however.

> 
> >  
> >  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >  {
> > @@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >  	return ret;
> >  }
> >  
> > +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> > +{
> > +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> > +			struct kvmppc_spapr_tce_iommu_table, rcu);
> > +
> > +	iommu_table_put(stit->tbl);
> > +	kvm_vfio_group_put_external_user(stit->group);
> > +
> > +	kfree(stit);
> > +}
> > +
> > +static void kvm_spapr_tce_liobn_release_iommu_group(
> > +		struct kvmppc_spapr_tce_table *stt,
> > +		struct vfio_group *group)
> > +{
> > +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> > +
> > +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> > +		if (group && (stit->group != group))
> > +			continue;
> > +
> > +		list_del_rcu(&stit->next);
> > +
> > +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> > +	}
> > +}
> > +
> > +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> > +		struct vfio_group *group)
> > +{
> > +	struct kvmppc_spapr_tce_table *stt;
> > +
> > +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> > +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> > +}
> > +
> > +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> > +		struct vfio_group *group)
> > +{
> > +	struct kvmppc_spapr_tce_table *stt = NULL;
> > +	bool found = false;
> > +	struct iommu_table *tbl = NULL;
> > +	struct iommu_table_group *table_group;
> > +	long i, ret = 0;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> > +	struct fd f;
> > +	int group_id;
> > +	struct iommu_group *grp;
> > +
> > +	group_id = kvm_vfio_external_user_iommu_id(group);
> > +	grp = iommu_group_get_by_id(group_id);
> > +	if (WARN_ON(!grp))
> > +		return -EIO;
> > +
> > +	f = fdget(tablefd);
> > +	if (!f.file) {
> > +		ret = -EBADF;
> > +		goto put_exit;
> > +	}
> > +
> > +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> > +		if (stt == f.file->private_data) {
> > +			found = true;
> > +			break;
> > +		}
> > +	}
> > +
> > +	fdput(f);
> > +
> > +	if (!found) {
> > +		ret = -EINVAL;
> > +		goto put_exit;
> > +	}
> > +
> > +	table_group = iommu_group_get_iommudata(grp);
> > +	if (WARN_ON(!table_group)) {
> > +		ret = -EFAULT;
> > +		goto put_exit;
> > +	}
> > +
> > +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> > +		struct iommu_table *tbltmp = table_group->tables[i];
> > +
> > +		if (!tbltmp)
> > +			continue;
> > +
> > +		/*
> > +		 * Make sure hardware table parameters are exactly the same;
> > +		 * this is used in the TCE handlers where boundary checks
> > +		 * use only the first attached table.
> > +		 */
> > +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> > +				(tbltmp->it_offset == stt->offset) &&
> > +				(tbltmp->it_size == stt->size)) {
> > +			tbl = tbltmp;
> > +			break;
> > +		}
> > +	}
> > +	if (!tbl) {
> > +		ret = -EINVAL;
> > +		goto put_exit;
> > +	}
> > +
> > +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
> > +		if ((stit->tbl == tbl) && (stit->group == group)) {
> > +			ret = -EBUSY;
> > +			goto put_exit;
> > +		}
> > +	}
> > +
> > +	iommu_table_get(tbl);
> > +
> > +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> > +	stit->tbl = tbl;
> > +	stit->group = group;
> > +
> > +	list_add_rcu(&stit->next, &stt->iommu_tables);
> > +
> > +put_exit:
> > +	iommu_group_put(grp);
> > +
> > +	return ret;
> > +}
> > +
> >  static void release_spapr_tce_table(struct rcu_head *head)
> >  {
> >  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> > @@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >  
> >  	list_del_rcu(&stt->list);
> >  
> > +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> > +
> >  	kvm_put_kvm(stt->kvm);
> >  
> >  	kvmppc_account_memlimit(
> > @@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >  	stt->offset = args->offset;
> >  	stt->size = size;
> >  	stt->kvm = kvm;
> > +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >  
> >  	for (i = 0; i < npages; i++) {
> >  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > @@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >  	return ret;
> >  }
> >  
> > +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	unsigned long hpa = 0;
> > +	enum dma_data_direction dir = DMA_NONE;
> > +
> > +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> > +}
> > +
> > +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> > +		struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	struct mm_iommu_table_group_mem_t *mem = NULL;
> > +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> > +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> > +
> > +	if (!pua)
> > +		/* it_userspace allocation might be delayed */
> > +		return H_TOO_HARD;
> > +
> > +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> > +	if (!mem)
> > +		return H_TOO_HARD;
> > +
> > +	mm_iommu_mapped_dec(mem);
> > +
> > +	*pua = 0;
> > +
> > +	return H_SUCCESS;
> > +}
> > +
> > +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> > +		struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	enum dma_data_direction dir = DMA_NONE;
> > +	unsigned long hpa = 0;
> > +	long ret;
> > +
> > +	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
> > +		return H_HARDWARE;
> > +
> > +	if (dir == DMA_NONE)
> > +		return H_SUCCESS;
> > +
> > +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> > +	if (ret != H_SUCCESS)
> > +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> > +
> > +	return ret;
> > +}
> > +
> > +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> > +		unsigned long entry, unsigned long ua,
> > +		enum dma_data_direction dir)
> > +{
> > +	long ret;
> > +	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> > +	struct mm_iommu_table_group_mem_t *mem;
> > +
> > +	if (!pua)
> > +		/* it_userspace allocation might be delayed */
> > +		return H_TOO_HARD;
> > +
> > +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> > +	if (!mem)
> > +		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> > +		return H_TOO_HARD;
> > +
> > +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> > +		return H_HARDWARE;
> > +
> > +	if (mm_iommu_mapped_inc(mem))
> > +		return H_CLOSED;
> > +
> > +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> > +	if (WARN_ON_ONCE(ret)) {
> > +		mm_iommu_mapped_dec(mem);
> > +		return H_HARDWARE;
> > +	}
> > +
> > +	if (dir != DMA_NONE)
> > +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> > +
> > +	*pua = ua;
> > +
> > +	return 0;
> > +}
> > +
> >  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >  		      unsigned long ioba, unsigned long tce)
> >  {
> >  	struct kvmppc_spapr_tce_table *stt;
> > -	long ret;
> > +	long ret, idx;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> > +	unsigned long entry, ua = 0;
> > +	enum dma_data_direction dir;
> >  
> >  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >  	/* 	    liobn, ioba, tce); */
> > @@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >  	if (ret != H_SUCCESS)
> >  		return ret;
> >  
> > -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> > +	dir = iommu_tce_direction(tce);
> > +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> > +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> > +		return H_PARAMETER;
> > +
> > +	entry = ioba >> stt->page_shift;
> > +
> > +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +		if (dir == DMA_NONE) {
> > +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> > +					stit->tbl, entry);
> > +		} else {
> > +			idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> > +					entry, ua, dir);
> > +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +		}
> > +
> > +		if (ret == H_SUCCESS)
> > +			continue;
> > +
> > +		if (ret == H_TOO_HARD)
> > +			return ret;
> > +
> > +		WARN_ON_ONCE(1);
> > +		kvmppc_clear_tce(stit->tbl, entry);
> > +	}
> > +
> > +	kvmppc_tce_put(stt, entry, tce);
> >  
> >  	return H_SUCCESS;
> >  }
> > @@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  	unsigned long entry, ua = 0;
> >  	u64 __user *tces;
> >  	u64 tce;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> >  
> >  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >  	if (!stt)
> > @@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  		if (ret != H_SUCCESS)
> >  			goto unlock_exit;
> >  
> > +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> > +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> > +				&ua, NULL))
> > +			return H_PARAMETER;
> > +
> > +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
> > +					stit->tbl, entry + i, ua,
> > +					iommu_tce_direction(tce));
> > +
> > +			if (ret == H_SUCCESS)
> > +				continue;
> > +
> > +			if (ret == H_TOO_HARD)
> > +				goto unlock_exit;
> > +
> > +			WARN_ON_ONCE(1);
> > +			kvmppc_clear_tce(stit->tbl, entry);
> > +		}
> > +
> >  		kvmppc_tce_put(stt, entry + i, tce);
> >  	}
> >  
> > @@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >  {
> >  	struct kvmppc_spapr_tce_table *stt;
> >  	long i, ret;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> >  
> >  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >  	if (!stt)
> > @@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >  		return H_PARAMETER;
> >  
> > +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> > +					stit->tbl, entry + i);
> > +
> > +			if (ret == H_SUCCESS)
> > +				continue;
> > +
> > +			if (ret == H_TOO_HARD)
> > +				return ret;
> > +
> > +			WARN_ON_ONCE(1);
> > +			kvmppc_clear_tce(stit->tbl, entry);
> > +		}
> > +	}
> > +
> >  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >  
> > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > index 440d3ab5dc32..eda0a8f6fae8 100644
> > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > @@ -40,6 +40,31 @@
> >  #include <asm/iommu.h>
> >  #include <asm/tce.h>
> >  
> > +#ifdef CONFIG_BUG
> > +
> > +#define WARN_ON_ONCE_RM(condition)	({			\
> > +	static bool __section(.data.unlikely) __warned;		\
> > +	int __ret_warn_once = !!(condition);			\
> > +								\
> > +	if (unlikely(__ret_warn_once && !__warned)) {		\
> > +		__warned = true;				\
> > +		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
> > +				__stringify(condition),		\
> > +				__func__, __LINE__);		\
> > +		dump_stack();					\
> > +	}							\
> > +	unlikely(__ret_warn_once);				\
> > +})
> > +
> > +#else
> > +
> > +#define WARN_ON_ONCE_RM(condition) ({				\
> > +	int __ret_warn_on = !!(condition);			\
> > +	unlikely(__ret_warn_on);				\
> > +})
> > +
> > +#endif
> > +
> >  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> >  
> >  /*
> > @@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >  
> >  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> > +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	unsigned long hpa = 0;
> > +	enum dma_data_direction dir = DMA_NONE;
> > +
> > +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> > +}
> > +
> > +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> > +		struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	struct mm_iommu_table_group_mem_t *mem = NULL;
> > +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> > +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> > +
> > +	if (!pua)
> > +		/* it_userspace allocation might be delayed */
> > +		return H_TOO_HARD;
> > +
> > +	pua = (void *) vmalloc_to_phys(pua);
> > +	if (WARN_ON_ONCE_RM(!pua))
> > +		return H_HARDWARE;
> > +
> > +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> > +	if (!mem)
> > +		return H_TOO_HARD;
> > +
> > +	mm_iommu_mapped_dec(mem);
> > +
> > +	*pua = 0;
> > +
> > +	return H_SUCCESS;
> > +}
> > +
> > +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> > +		struct iommu_table *tbl, unsigned long entry)
> > +{
> > +	enum dma_data_direction dir = DMA_NONE;
> > +	unsigned long hpa = 0;
> > +	long ret;
> > +
> > +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> > +		/*
> > +		 * real mode xchg can fail if struct page crosses
> > +		 * a page boundary
> > +		 */
> > +		return H_TOO_HARD;
> > +
> > +	if (dir == DMA_NONE)
> > +		return H_SUCCESS;
> > +
> > +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> > +	if (ret)
> > +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> > +
> > +	return ret;
> > +}
> > +
> > +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> > +		unsigned long entry, unsigned long ua,
> > +		enum dma_data_direction dir)
> > +{
> > +	long ret;
> > +	unsigned long hpa = 0;
> > +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> > +	struct mm_iommu_table_group_mem_t *mem;
> > +
> > +	if (!pua)
> > +		/* it_userspace allocation might be delayed */
> > +		return H_TOO_HARD;
> > +
> > +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> > +	if (!mem)
> > +		return H_TOO_HARD;
> > +
> > +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> > +		return H_HARDWARE;
> > +
> > +	pua = (void *) vmalloc_to_phys(pua);
> > +	if (WARN_ON_ONCE_RM(!pua))
> > +		return H_HARDWARE;
> > +
> > +	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
> > +		return H_CLOSED;
> > +
> > +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> > +	if (ret) {
> > +		mm_iommu_mapped_dec(mem);
> > +		/*
> > +		 * real mode xchg can fail if struct page crosses
> > +		 * a page boundary
> > +		 */
> > +		return H_TOO_HARD;
> > +	}
> > +
> > +	if (dir != DMA_NONE)
> > +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> > +
> > +	*pua = ua;
> > +
> > +	return 0;
> > +}
> > +
> >  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >  		unsigned long ioba, unsigned long tce)
> >  {
> >  	struct kvmppc_spapr_tce_table *stt;
> >  	long ret;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> > +	unsigned long entry, ua = 0;
> > +	enum dma_data_direction dir;
> >  
> >  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >  	/* 	    liobn, ioba, tce); */
> > @@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >  	if (ret != H_SUCCESS)
> >  		return ret;
> >  
> > -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> > +	dir = iommu_tce_direction(tce);
> > +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
> > +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
> > +		return H_PARAMETER;
> > +
> > +	entry = ioba >> stt->page_shift;
> > +
> > +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +		if (dir == DMA_NONE)
> > +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> > +					stit->tbl, entry);
> > +		else
> > +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> > +					stit->tbl, entry, ua, dir);
> > +
> > +		if (ret == H_SUCCESS)
> > +			continue;
> > +
> > +		if (ret == H_TOO_HARD)
> > +			return ret;
> > +
> > +		WARN_ON_ONCE_RM(1);
> > +		kvmppc_rm_clear_tce(stit->tbl, entry);
> > +	}
> > +
> > +	kvmppc_tce_put(stt, entry, tce);
> >  
> >  	return H_SUCCESS;
> >  }
> > @@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  	unsigned long tces, entry, ua = 0;
> >  	unsigned long *rmap = NULL;
> >  	bool prereg = false;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> >  
> >  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >  	if (!stt)
> > @@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  			return H_TOO_HARD;
> >  
> >  		rmap = (void *) vmalloc_to_phys(rmap);
> > +		if (WARN_ON_ONCE_RM(!rmap))
> > +			return H_HARDWARE;
> >  
> >  		/*
> >  		 * Synchronize with the MMU notifier callbacks in
> > @@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  		if (ret != H_SUCCESS)
> >  			goto unlock_exit;
> >  
> > +		ua = 0;
> > +		if (kvmppc_gpa_to_ua(vcpu->kvm,
> > +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
> > +				&ua, NULL))
> > +			return H_PARAMETER;
> > +
> > +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> > +					stit->tbl, entry + i, ua,
> > +					iommu_tce_direction(tce));
> > +
> > +			if (ret == H_SUCCESS)
> > +				continue;
> > +
> > +			if (ret == H_TOO_HARD)
> > +				goto unlock_exit;
> > +
> > +			WARN_ON_ONCE_RM(1);
> > +			kvmppc_rm_clear_tce(stit->tbl, entry);
> > +		}
> > +
> >  		kvmppc_tce_put(stt, entry + i, tce);
> >  	}
> >  
> > @@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >  {
> >  	struct kvmppc_spapr_tce_table *stt;
> >  	long i, ret;
> > +	struct kvmppc_spapr_tce_iommu_table *stit;
> >  
> >  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >  	if (!stt)
> > @@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >  		return H_PARAMETER;
> >  
> > +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> > +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> > +					stit->tbl, entry + i);
> > +
> > +			if (ret == H_SUCCESS)
> > +				continue;
> > +
> > +			if (ret == H_TOO_HARD)
> > +				return ret;
> > +
> > +			WARN_ON_ONCE_RM(1);
> > +			kvmppc_rm_clear_tce(stit->tbl, entry);
> > +		}
> > +	}
> > +
> >  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >  
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 95c91a9de351..62bdd6c48107 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  #ifdef CONFIG_PPC_BOOK3S_64
> >  	case KVM_CAP_SPAPR_TCE:
> >  	case KVM_CAP_SPAPR_TCE_64:
> > +		/* fallthrough */
> > +	case KVM_CAP_SPAPR_TCE_VFIO:
> >  	case KVM_CAP_PPC_RTAS:
> >  	case KVM_CAP_PPC_FIXUP_HCALL:
> >  	case KVM_CAP_PPC_ENABLE_HCALL:
> > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> > index d32f239eb471..2b7dc22265fe 100644
> > --- a/virt/kvm/vfio.c
> > +++ b/virt/kvm/vfio.c
> > @@ -20,6 +20,10 @@
> >  #include <linux/vfio.h>
> >  #include "vfio.h"
> >  
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +#include <asm/kvm_ppc.h>
> > +#endif
> > +
> >  struct kvm_vfio_group {
> >  	struct list_head node;
> >  	struct vfio_group *vfio_group;
> > @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >  
> >  		mutex_unlock(&kv->lock);
> >  
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> > +#endif
> >  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >  
> >  		kvm_vfio_group_put_external_user(vfio_group);
> > @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >  		kvm_vfio_update_coherency(dev);
> >  
> >  		return ret;
> > +
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> > +		struct kvm_vfio_spapr_tce param;
> > +		unsigned long minsz;
> > +		struct kvm_vfio *kv = dev->private;
> > +		struct vfio_group *vfio_group;
> > +		struct kvm_vfio_group *kvg;
> > +		struct fd f;
> > +
> > +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> > +
> > +		if (copy_from_user(&param, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (param.argsz < minsz || param.flags)
> > +			return -EINVAL;
> > +
> > +		f = fdget(param.groupfd);
> > +		if (!f.file)
> > +			return -EBADF;
> > +
> > +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> > +		fdput(f);
> > +
> > +		if (IS_ERR(vfio_group))
> > +			return PTR_ERR(vfio_group);
> > +
> > +		ret = -ENOENT;
> > +
> > +		mutex_lock(&kv->lock);
> > +
> > +		list_for_each_entry(kvg, &kv->group_list, node) {
> > +			if (kvg->vfio_group != vfio_group)
> > +				continue;
> > +
> > +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> > +					param.tablefd, vfio_group);
> > +
> > +			break;
> > +		}
> > +
> > +		mutex_unlock(&kv->lock);
> > +
> > +		return ret;
> > +	}
> > +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> 
> 
> The group reference is leaked if kvm_spapr_tce_attach_iommu_group()
> fails.

Good catch.

> My preference would be to not hold that separate group
> reference in the spapr code anyway, having a parallel life cycle over
> there is confusing and results in ugliness like duplicating 
> kvm_vfio_group_put_external_user().  Thanks,
> 
> Alex
> 
> >  	}
> >  
> >  	return -ENXIO;
> > @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >  		switch (attr->attr) {
> >  		case KVM_DEV_VFIO_GROUP_ADD:
> >  		case KVM_DEV_VFIO_GROUP_DEL:
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> > +#endif
> >  			return 0;
> >  		}
> >  
> > @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >  	struct kvm_vfio_group *kvg, *tmp;
> >  
> >  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> > +#ifdef CONFIG_SPAPR_TCE_IOMMU
> > +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> > +#endif
> >  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >  		list_del(&kvg->node);
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-14 21:05     ` Alex Williamson
@ 2017-03-15 13:21       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-15 13:21 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 15/03/17 08:05, Alex Williamson wrote:
> On Fri, 10 Mar 2017 14:53:37 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> If we fail to update a hardware IOMMU table unexpected reason, we just
>> clear it and move on as there is nothing really we can do about it -
>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>> will be mirrored automatically to the hardware and there is no interface
>> to report to the guest about possible failures.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This adds real mode version of WARN_ON_ONCE() as the generic version
>> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
>> returns in the code, this also adds a check for already existing
>> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v8:
>> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
>> to handle them
>> * changed vmalloc_to_phys() callers to return H_HARDWARE
>> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
>> and added a comment about this in the code
>> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
>> and do WARN_ON
>> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
>> have all vmalloc_to_phys() callsites covered
>>
>> v7:
>> * added realmode-friendly WARN_ON_ONCE_RM
>>
>> v6:
>> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
>> * moved kvmppc_gpa_to_ua() to TCE validation
>>
>> v5:
>> * changed error codes in multiple places
>> * added bunch of WARN_ON() in places which should not really happen
>> * adde a check that an iommu table is not attached already to LIOBN
>> * dropped explicit calls to iommu_tce_clear_param_check/
>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>> call them anyway (since the previous patch)
>> * if we fail to update a hardware IOMMU table for unexpected reason,
>> this just clears the entry
>>
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed later
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            |  60 ++++++
>>  8 files changed, 623 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__u32	flags;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> 
> kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?

Correct.


> 
>> +	@flags are not supported now, must be zero;
> 
> We do this argsz/flags thing on vfio ioctls because ioctls are a bit
> more of a restricted resource.  We don't want to burn through them so
> we make them expandable.  I don't know that we have that restriction
> here and the ADD/DEL support certainly doesn't include it.  Maybe this
> isn't necessary?


It is not but since I am going to have padding there, I thought I give it a
name. Totally pointless and "u8 padding[4]" is better?


>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 7bba8f415627..857ae2c6aa39 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct vfio_group *group;
>> +	struct iommu_table *tbl;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 72c2a155641f..66de7e73b3d3 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group);
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index f5a52ffb6b58..e743cb0d176e 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1088,6 +1088,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1109,6 +1110,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +};
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index e96a4590464c..be18cda01e1b 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -28,6 +28,10 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>> +#include <linux/file.h>
>> +#include <linux/vfio.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -40,6 +44,36 @@
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>> +
>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>> +{
>> +	void (*fn)(struct vfio_group *);
>> +
>> +	fn = symbol_get(vfio_group_put_external_user);
>> +	if (WARN_ON(!fn))
>> +		return;
>> +
>> +	fn(vfio_group);
>> +
>> +	symbol_put(vfio_group_put_external_user);
>> +}
>> +
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
> 
> 
> Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> iommu_group?  Why do you need to hold this additional vfio_group
> reference?


This is embarassing but now I am not even sure I really need to hold a
reference to a group in any form, I tend to think this is a leftover from
older versions when iommu_table did not have reference counting.

iommu_table struct have a reference counter now and this patch is only
concerned about iommu_table structs, and they are referenced by KVM.

iommu_group structs are referenced as long as a pnv_ioda_pe (IODA's PE ==
IOMMU group in powenrv) lives and iommu_group structs get attached/detached
to/from iommu_table structs when a DMA window is set/unset to PE - this is
controlled by vfio_iommu_spapr_tce ioctls() + rcu. If an iommu_table does
not have attached iommu_group, it is just a table without any cache
invalidation.

So in v9 I'll remove the group reference, and I wonder if I am missing
anything now?


>>  
>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>  {
>> @@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>> +
>> +	iommu_table_put(stit->tbl);
>> +	kvm_vfio_group_put_external_user(stit->group);
>> +
>> +	kfree(stit);
>> +}
>> +
>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>> +		struct kvmppc_spapr_tce_table *stt,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>> +
>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>> +		if (group && (stit->group != group))
>> +			continue;
>> +
>> +		list_del_rcu(&stit->next);
>> +
>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>> +	}
>> +}
>> +
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>> +}
>> +
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	bool found = false;
>> +	struct iommu_table *tbl = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i, ret = 0;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	struct fd f;
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (WARN_ON(!grp))
>> +		return -EIO;
>> +
>> +	f = fdget(tablefd);
>> +	if (!f.file) {
>> +		ret = -EBADF;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt == f.file->private_data) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	fdput(f);
>> +
>> +	if (!found) {
>> +		ret = -EINVAL;
>> +		goto put_exit;
>> +	}
>> +
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (WARN_ON(!table_group)) {
>> +		ret = -EFAULT;
>> +		goto put_exit;
>> +	}
>> +
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbltmp = table_group->tables[i];
>> +
>> +		if (!tbltmp)
>> +			continue;
>> +
>> +		/*
>> +		 * Make sure hardware table parameters are exactly the same;
>> +		 * this is used in the TCE handlers where boundary checks
>> +		 * use only the first attached table.
>> +		 */
>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>> +				(tbltmp->it_offset == stt->offset) &&
>> +				(tbltmp->it_size == stt->size)) {
>> +			tbl = tbltmp;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbl) {
>> +		ret = -EINVAL;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
>> +		if ((stit->tbl == tbl) && (stit->group == group)) {
>> +			ret = -EBUSY;
>> +			goto put_exit;
>> +		}
>> +	}
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>> +	stit->tbl = tbl;
>> +	stit->group = group;
>> +
>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
>> +
>> +put_exit:
>> +	iommu_group_put(grp);
>> +
>> +	return ret;
>> +}
>> +
>>  static void release_spapr_tce_table(struct rcu_head *head)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>> @@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>  
>>  	list_del_rcu(&stt->list);
>>  
>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>> +
>>  	kvm_put_kvm(stt->kvm);
>>  
>>  	kvmppc_account_memlimit(
>> @@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	stt->offset = args->offset;
>>  	stt->size = size;
>>  	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>  
>>  	for (i = 0; i < npages; i++) {
>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> @@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	unsigned long hpa = 0;
>> +	enum dma_data_direction dir = DMA_NONE;
>> +
>> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +}
>> +
>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret != H_SUCCESS)
>> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long ua,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
>> +		return H_TOO_HARD;
>> +
>> +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_CLOSED;
>> +
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (WARN_ON_ONCE(ret)) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_HARDWARE;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		      unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>> -	long ret;
>> +	long ret, idx;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, ua = 0;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>> +	dir = iommu_tce_direction(tce);
>> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
>> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	entry = ioba >> stt->page_shift;
>> +
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		if (dir == DMA_NONE) {
>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry);
>> +		} else {
>> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
>> +					entry, ua, dir);
>> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +		}
>> +
>> +		if (ret == H_SUCCESS)
>> +			continue;
>> +
>> +		if (ret == H_TOO_HARD)
>> +			return ret;
>> +
>> +		WARN_ON_ONCE(1);
>> +		kvmppc_clear_tce(stit->tbl, entry);
>> +	}
>> +
>> +	kvmppc_tce_put(stt, entry, tce);
>>  
>>  	return H_SUCCESS;
>>  }
>> @@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	unsigned long entry, ua = 0;
>>  	u64 __user *tces;
>>  	u64 tce;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>>  
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
>> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
>> +				&ua, NULL))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry + i, ua,
>> +					iommu_tce_direction(tce));
>> +
>> +			if (ret == H_SUCCESS)
>> +				continue;
>> +
>> +			if (ret == H_TOO_HARD)
>> +				goto unlock_exit;
>> +
>> +			WARN_ON_ONCE(1);
>> +			kvmppc_clear_tce(stit->tbl, entry);
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +		for (i = 0; i < npages; ++i) {
>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry + i);
>> +
>> +			if (ret == H_SUCCESS)
>> +				continue;
>> +
>> +			if (ret == H_TOO_HARD)
>> +				return ret;
>> +
>> +			WARN_ON_ONCE(1);
>> +			kvmppc_clear_tce(stit->tbl, entry);
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 440d3ab5dc32..eda0a8f6fae8 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -40,6 +40,31 @@
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>>  
>> +#ifdef CONFIG_BUG
>> +
>> +#define WARN_ON_ONCE_RM(condition)	({			\
>> +	static bool __section(.data.unlikely) __warned;		\
>> +	int __ret_warn_once = !!(condition);			\
>> +								\
>> +	if (unlikely(__ret_warn_once && !__warned)) {		\
>> +		__warned = true;				\
>> +		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
>> +				__stringify(condition),		\
>> +				__func__, __LINE__);		\
>> +		dump_stack();					\
>> +	}							\
>> +	unlikely(__ret_warn_once);				\
>> +})
>> +
>> +#else
>> +
>> +#define WARN_ON_ONCE_RM(condition) ({				\
>> +	int __ret_warn_on = !!(condition);			\
>> +	unlikely(__ret_warn_on);				\
>> +})
>> +
>> +#endif
>> +
>>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>>  
>>  /*
>> @@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>  
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	unsigned long hpa = 0;
>> +	enum dma_data_direction dir = DMA_NONE;
>> +
>> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (WARN_ON_ONCE_RM(!pua))
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>> +		/*
>> +		 * real mode xchg can fail if struct page crosses
>> +		 * a page boundary
>> +		 */
>> +		return H_TOO_HARD;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret)
>> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long ua,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa = 0;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (WARN_ON_ONCE_RM(!pua))
>> +		return H_HARDWARE;
>> +
>> +	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
>> +		return H_CLOSED;
>> +
>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		/*
>> +		 * real mode xchg can fail if struct page crosses
>> +		 * a page boundary
>> +		 */
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, ua = 0;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>> +	dir = iommu_tce_direction(tce);
>> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
>> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	entry = ioba >> stt->page_shift;
>> +
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		if (dir == DMA_NONE)
>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry);
>> +		else
>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry, ua, dir);
>> +
>> +		if (ret == H_SUCCESS)
>> +			continue;
>> +
>> +		if (ret == H_TOO_HARD)
>> +			return ret;
>> +
>> +		WARN_ON_ONCE_RM(1);
>> +		kvmppc_rm_clear_tce(stit->tbl, entry);
>> +	}
>> +
>> +	kvmppc_tce_put(stt, entry, tce);
>>  
>>  	return H_SUCCESS;
>>  }
>> @@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	unsigned long tces, entry, ua = 0;
>>  	unsigned long *rmap = NULL;
>>  	bool prereg = false;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  			return H_TOO_HARD;
>>  
>>  		rmap = (void *) vmalloc_to_phys(rmap);
>> +		if (WARN_ON_ONCE_RM(!rmap))
>> +			return H_HARDWARE;
>>  
>>  		/*
>>  		 * Synchronize with the MMU notifier callbacks in
>> @@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>>  
>> +		ua = 0;
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
>> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
>> +				&ua, NULL))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry + i, ua,
>> +					iommu_tce_direction(tce));
>> +
>> +			if (ret == H_SUCCESS)
>> +				continue;
>> +
>> +			if (ret == H_TOO_HARD)
>> +				goto unlock_exit;
>> +
>> +			WARN_ON_ONCE_RM(1);
>> +			kvmppc_rm_clear_tce(stit->tbl, entry);
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +		for (i = 0; i < npages; ++i) {
>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry + i);
>> +
>> +			if (ret == H_SUCCESS)
>> +				continue;
>> +
>> +			if (ret == H_TOO_HARD)
>> +				return ret;
>> +
>> +			WARN_ON_ONCE_RM(1);
>> +			kvmppc_rm_clear_tce(stit->tbl, entry);
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 95c91a9de351..62bdd6c48107 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  	case KVM_CAP_SPAPR_TCE:
>>  	case KVM_CAP_SPAPR_TCE_64:
>> +		/* fallthrough */
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>  	case KVM_CAP_PPC_RTAS:
>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index d32f239eb471..2b7dc22265fe 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  
>>  		mutex_unlock(&kv->lock);
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>  
>>  		kvm_vfio_group_put_external_user(vfio_group);
>> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  		kvm_vfio_update_coherency(dev);
>>  
>>  		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>> +		struct kvm_vfio_spapr_tce param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz || param.flags)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.groupfd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
>> +		ret = -ENOENT;
>> +
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.tablefd, vfio_group);
>> +
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> 
> 
> The group reference is leaked if kvm_spapr_tce_attach_iommu_group()
> fails.  My preference would be to not hold that separate group
> reference in the spapr code anyway, having a parallel life cycle over
> there is confusing and results in ugliness like duplicating 
> kvm_vfio_group_put_external_user().  Thanks,
> 
> Alex
> 
>>  	}
>>  
>>  	return -ENXIO;
>> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>  		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>> +#endif
>>  			return 0;
>>  		}
>>  
>> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>  	struct kvm_vfio_group *kvg, *tmp;
>>  
>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>  		list_del(&kvg->node);
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-15 13:21       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-15 13:21 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 15/03/17 08:05, Alex Williamson wrote:
> On Fri, 10 Mar 2017 14:53:37 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> If we fail to update a hardware IOMMU table unexpected reason, we just
>> clear it and move on as there is nothing really we can do about it -
>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>> will be mirrored automatically to the hardware and there is no interface
>> to report to the guest about possible failures.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This adds real mode version of WARN_ON_ONCE() as the generic version
>> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
>> returns in the code, this also adds a check for already existing
>> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v8:
>> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
>> to handle them
>> * changed vmalloc_to_phys() callers to return H_HARDWARE
>> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
>> and added a comment about this in the code
>> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
>> and do WARN_ON
>> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
>> have all vmalloc_to_phys() callsites covered
>>
>> v7:
>> * added realmode-friendly WARN_ON_ONCE_RM
>>
>> v6:
>> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
>> * moved kvmppc_gpa_to_ua() to TCE validation
>>
>> v5:
>> * changed error codes in multiple places
>> * added bunch of WARN_ON() in places which should not really happen
>> * adde a check that an iommu table is not attached already to LIOBN
>> * dropped explicit calls to iommu_tce_clear_param_check/
>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>> call them anyway (since the previous patch)
>> * if we fail to update a hardware IOMMU table for unexpected reason,
>> this just clears the entry
>>
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed later
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            |  60 ++++++
>>  8 files changed, 623 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__u32	flags;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> 
> kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?

Correct.


> 
>> +	@flags are not supported now, must be zero;
> 
> We do this argsz/flags thing on vfio ioctls because ioctls are a bit
> more of a restricted resource.  We don't want to burn through them so
> we make them expandable.  I don't know that we have that restriction
> here and the ADD/DEL support certainly doesn't include it.  Maybe this
> isn't necessary?


It is not but since I am going to have padding there, I thought I give it a
name. Totally pointless and "u8 padding[4]" is better?


>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 7bba8f415627..857ae2c6aa39 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct vfio_group *group;
>> +	struct iommu_table *tbl;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 72c2a155641f..66de7e73b3d3 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -164,6 +164,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group);
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index f5a52ffb6b58..e743cb0d176e 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1088,6 +1088,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1109,6 +1110,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +};
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index e96a4590464c..be18cda01e1b 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -28,6 +28,10 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>> +#include <linux/file.h>
>> +#include <linux/vfio.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -40,6 +44,36 @@
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>> +
>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>> +{
>> +	void (*fn)(struct vfio_group *);
>> +
>> +	fn = symbol_get(vfio_group_put_external_user);
>> +	if (WARN_ON(!fn))
>> +		return;
>> +
>> +	fn(vfio_group);
>> +
>> +	symbol_put(vfio_group_put_external_user);
>> +}
>> +
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
> 
> 
> Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> iommu_group?  Why do you need to hold this additional vfio_group
> reference?


This is embarassing but now I am not even sure I really need to hold a
reference to a group in any form, I tend to think this is a leftover from
older versions when iommu_table did not have reference counting.

iommu_table struct have a reference counter now and this patch is only
concerned about iommu_table structs, and they are referenced by KVM.

iommu_group structs are referenced as long as a pnv_ioda_pe (IODA's PE =
IOMMU group in powenrv) lives and iommu_group structs get attached/detached
to/from iommu_table structs when a DMA window is set/unset to PE - this is
controlled by vfio_iommu_spapr_tce ioctls() + rcu. If an iommu_table does
not have attached iommu_group, it is just a table without any cache
invalidation.

So in v9 I'll remove the group reference, and I wonder if I am missing
anything now?


>>  
>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>  {
>> @@ -91,6 +125,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>> +
>> +	iommu_table_put(stit->tbl);
>> +	kvm_vfio_group_put_external_user(stit->group);
>> +
>> +	kfree(stit);
>> +}
>> +
>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>> +		struct kvmppc_spapr_tce_table *stt,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>> +
>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>> +		if (group && (stit->group != group))
>> +			continue;
>> +
>> +		list_del_rcu(&stit->next);
>> +
>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>> +	}
>> +}
>> +
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>> +}
>> +
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	bool found = false;
>> +	struct iommu_table *tbl = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i, ret = 0;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	struct fd f;
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (WARN_ON(!grp))
>> +		return -EIO;
>> +
>> +	f = fdget(tablefd);
>> +	if (!f.file) {
>> +		ret = -EBADF;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt = f.file->private_data) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	fdput(f);
>> +
>> +	if (!found) {
>> +		ret = -EINVAL;
>> +		goto put_exit;
>> +	}
>> +
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (WARN_ON(!table_group)) {
>> +		ret = -EFAULT;
>> +		goto put_exit;
>> +	}
>> +
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbltmp = table_group->tables[i];
>> +
>> +		if (!tbltmp)
>> +			continue;
>> +
>> +		/*
>> +		 * Make sure hardware table parameters are exactly the same;
>> +		 * this is used in the TCE handlers where boundary checks
>> +		 * use only the first attached table.
>> +		 */
>> +		if ((tbltmp->it_page_shift = stt->page_shift) &&
>> +				(tbltmp->it_offset = stt->offset) &&
>> +				(tbltmp->it_size = stt->size)) {
>> +			tbl = tbltmp;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbl) {
>> +		ret = -EINVAL;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
>> +		if ((stit->tbl = tbl) && (stit->group = group)) {
>> +			ret = -EBUSY;
>> +			goto put_exit;
>> +		}
>> +	}
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>> +	stit->tbl = tbl;
>> +	stit->group = group;
>> +
>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
>> +
>> +put_exit:
>> +	iommu_group_put(grp);
>> +
>> +	return ret;
>> +}
>> +
>>  static void release_spapr_tce_table(struct rcu_head *head)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>> @@ -133,6 +291,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>  
>>  	list_del_rcu(&stt->list);
>>  
>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>> +
>>  	kvm_put_kvm(stt->kvm);
>>  
>>  	kvmppc_account_memlimit(
>> @@ -183,6 +343,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	stt->offset = args->offset;
>>  	stt->size = size;
>>  	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>  
>>  	for (i = 0; i < npages; i++) {
>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> @@ -211,11 +372,101 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	unsigned long hpa = 0;
>> +	enum dma_data_direction dir = DMA_NONE;
>> +
>> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +}
>> +
>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
>> +		return H_HARDWARE;
>> +
>> +	if (dir = DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret != H_SUCCESS)
>> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long ua,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
>> +		return H_TOO_HARD;
>> +
>> +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_CLOSED;
>> +
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (WARN_ON_ONCE(ret)) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_HARDWARE;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		      unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>> -	long ret;
>> +	long ret, idx;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, ua = 0;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -232,7 +483,35 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>> +	dir = iommu_tce_direction(tce);
>> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
>> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	entry = ioba >> stt->page_shift;
>> +
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		if (dir = DMA_NONE) {
>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry);
>> +		} else {
>> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
>> +					entry, ua, dir);
>> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +		}
>> +
>> +		if (ret = H_SUCCESS)
>> +			continue;
>> +
>> +		if (ret = H_TOO_HARD)
>> +			return ret;
>> +
>> +		WARN_ON_ONCE(1);
>> +		kvmppc_clear_tce(stit->tbl, entry);
>> +	}
>> +
>> +	kvmppc_tce_put(stt, entry, tce);
>>  
>>  	return H_SUCCESS;
>>  }
>> @@ -247,6 +526,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	unsigned long entry, ua = 0;
>>  	u64 __user *tces;
>>  	u64 tce;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -285,6 +565,26 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>>  
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
>> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
>> +				&ua, NULL))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry + i, ua,
>> +					iommu_tce_direction(tce));
>> +
>> +			if (ret = H_SUCCESS)
>> +				continue;
>> +
>> +			if (ret = H_TOO_HARD)
>> +				goto unlock_exit;
>> +
>> +			WARN_ON_ONCE(1);
>> +			kvmppc_clear_tce(stit->tbl, entry);
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -301,6 +601,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -314,6 +615,24 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +		for (i = 0; i < npages; ++i) {
>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry + i);
>> +
>> +			if (ret = H_SUCCESS)
>> +				continue;
>> +
>> +			if (ret = H_TOO_HARD)
>> +				return ret;
>> +
>> +			WARN_ON_ONCE(1);
>> +			kvmppc_clear_tce(stit->tbl, entry);
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 440d3ab5dc32..eda0a8f6fae8 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -40,6 +40,31 @@
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>>  
>> +#ifdef CONFIG_BUG
>> +
>> +#define WARN_ON_ONCE_RM(condition)	({			\
>> +	static bool __section(.data.unlikely) __warned;		\
>> +	int __ret_warn_once = !!(condition);			\
>> +								\
>> +	if (unlikely(__ret_warn_once && !__warned)) {		\
>> +		__warned = true;				\
>> +		pr_err("WARN_ON_ONCE_RM: (%s) at %s:%u\n",	\
>> +				__stringify(condition),		\
>> +				__func__, __LINE__);		\
>> +		dump_stack();					\
>> +	}							\
>> +	unlikely(__ret_warn_once);				\
>> +})
>> +
>> +#else
>> +
>> +#define WARN_ON_ONCE_RM(condition) ({				\
>> +	int __ret_warn_on = !!(condition);			\
>> +	unlikely(__ret_warn_on);				\
>> +})
>> +
>> +#endif
>> +
>>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>>  
>>  /*
>> @@ -161,11 +186,117 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>  
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	unsigned long hpa = 0;
>> +	enum dma_data_direction dir = DMA_NONE;
>> +
>> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (WARN_ON_ONCE_RM(!pua))
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>> +		/*
>> +		 * real mode xchg can fail if struct page crosses
>> +		 * a page boundary
>> +		 */
>> +		return H_TOO_HARD;
>> +
>> +	if (dir = DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret)
>> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long ua,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa = 0;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (WARN_ON_ONCE_RM(!pua))
>> +		return H_HARDWARE;
>> +
>> +	if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
>> +		return H_CLOSED;
>> +
>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		/*
>> +		 * real mode xchg can fail if struct page crosses
>> +		 * a page boundary
>> +		 */
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, ua = 0;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -182,7 +313,32 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>> +	dir = iommu_tce_direction(tce);
>> +	if ((dir != DMA_NONE) && kvmppc_gpa_to_ua(vcpu->kvm,
>> +			tce & ~(TCE_PCI_READ | TCE_PCI_WRITE), &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	entry = ioba >> stt->page_shift;
>> +
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		if (dir = DMA_NONE)
>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry);
>> +		else
>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry, ua, dir);
>> +
>> +		if (ret = H_SUCCESS)
>> +			continue;
>> +
>> +		if (ret = H_TOO_HARD)
>> +			return ret;
>> +
>> +		WARN_ON_ONCE_RM(1);
>> +		kvmppc_rm_clear_tce(stit->tbl, entry);
>> +	}
>> +
>> +	kvmppc_tce_put(stt, entry, tce);
>>  
>>  	return H_SUCCESS;
>>  }
>> @@ -223,6 +379,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	unsigned long tces, entry, ua = 0;
>>  	unsigned long *rmap = NULL;
>>  	bool prereg = false;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -270,6 +427,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  			return H_TOO_HARD;
>>  
>>  		rmap = (void *) vmalloc_to_phys(rmap);
>> +		if (WARN_ON_ONCE_RM(!rmap))
>> +			return H_HARDWARE;
>>  
>>  		/*
>>  		 * Synchronize with the MMU notifier callbacks in
>> @@ -293,6 +452,27 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>>  
>> +		ua = 0;
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm,
>> +				tce & ~(TCE_PCI_READ | TCE_PCI_WRITE),
>> +				&ua, NULL))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry + i, ua,
>> +					iommu_tce_direction(tce));
>> +
>> +			if (ret = H_SUCCESS)
>> +				continue;
>> +
>> +			if (ret = H_TOO_HARD)
>> +				goto unlock_exit;
>> +
>> +			WARN_ON_ONCE_RM(1);
>> +			kvmppc_rm_clear_tce(stit->tbl, entry);
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -309,6 +489,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -322,6 +503,24 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +		for (i = 0; i < npages; ++i) {
>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry + i);
>> +
>> +			if (ret = H_SUCCESS)
>> +				continue;
>> +
>> +			if (ret = H_TOO_HARD)
>> +				return ret;
>> +
>> +			WARN_ON_ONCE_RM(1);
>> +			kvmppc_rm_clear_tce(stit->tbl, entry);
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 95c91a9de351..62bdd6c48107 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -538,6 +538,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  	case KVM_CAP_SPAPR_TCE:
>>  	case KVM_CAP_SPAPR_TCE_64:
>> +		/* fallthrough */
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>  	case KVM_CAP_PPC_RTAS:
>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index d32f239eb471..2b7dc22265fe 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  
>>  		mutex_unlock(&kv->lock);
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>  
>>  		kvm_vfio_group_put_external_user(vfio_group);
>> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  		kvm_vfio_update_coherency(dev);
>>  
>>  		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>> +		struct kvm_vfio_spapr_tce param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz || param.flags)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.groupfd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
>> +		ret = -ENOENT;
>> +
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.tablefd, vfio_group);
>> +
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> 
> 
> The group reference is leaked if kvm_spapr_tce_attach_iommu_group()
> fails.  My preference would be to not hold that separate group
> reference in the spapr code anyway, having a parallel life cycle over
> there is confusing and results in ugliness like duplicating 
> kvm_vfio_group_put_external_user().  Thanks,
> 
> Alex
> 
>>  	}
>>  
>>  	return -ENXIO;
>> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>  		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>> +#endif
>>  			return 0;
>>  		}
>>  
>> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>  	struct kvm_vfio_group *kvg, *tmp;
>>  
>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>  		list_del(&kvg->node);
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-15  4:40       ` David Gibson
  (?)
@ 2017-03-15 16:18         ` Alex Williamson
  -1 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-15 16:18 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Paul Mackerras, linuxppc-dev, kvm, kvm-ppc

On Wed, 15 Mar 2017 15:40:14 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > > index e96a4590464c..be18cda01e1b 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > @@ -28,6 +28,10 @@
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/list.h>
> > >  #include <linux/anon_inodes.h>
> > > +#include <linux/iommu.h>
> > > +#include <linux/file.h>
> > > +#include <linux/vfio.h>
> > > +#include <linux/module.h>
> > >  
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/kvm_ppc.h>
> > > @@ -40,6 +44,36 @@
> > >  #include <asm/udbg.h>
> > >  #include <asm/iommu.h>
> > >  #include <asm/tce.h>
> > > +#include <asm/mmu_context.h>
> > > +
> > > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > > +{
> > > +	void (*fn)(struct vfio_group *);
> > > +
> > > +	fn = symbol_get(vfio_group_put_external_user);
> > > +	if (WARN_ON(!fn))
> > > +		return;
> > > +
> > > +	fn(vfio_group);
> > > +
> > > +	symbol_put(vfio_group_put_external_user);
> > > +}
> > > +
> > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > > +{
> > > +	int (*fn)(struct vfio_group *);
> > > +	int ret = -1;
> > > +
> > > +	fn = symbol_get(vfio_external_user_iommu_id);
> > > +	if (!fn)
> > > +		return ret;
> > > +
> > > +	ret = fn(vfio_group);
> > > +
> > > +	symbol_put(vfio_external_user_iommu_id);
> > > +
> > > +	return ret;
> > > +}  
> > 
> > 
> > Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> > iommu_group?  Why do you need to hold this additional vfio_group
> > reference?  
> 
> Keeping the vfio_group reference makes sense to me, since we don't
> want the vfio context for the group to go away while it's attached to
> the LIOBN.

But there's already a reference for that, it's taken by
KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL.  Both the
DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group()
before releasing that reference, so it seems entirely redundant.

> However, going via the iommu_id rather than just having an interface
> to directly grab the iommu group from the vfio_group seems bizarre to
> me.  I'm ok with cleaning that up later, however.

We have kvm_spapr_tce_attach_iommu_group() and
kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an
iommu_group as a parameter.  I don't particularly have a problem with
the vfio_group -> iommu ID -> iommu_group, but if we drop the extra
vfio_group reference and pass the iommu_group itself to these functions
then we can keep all the symbol reference stuff in the kvm-vfio glue
layer.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-15 16:18         ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-15 16:18 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

On Wed, 15 Mar 2017 15:40:14 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > > index e96a4590464c..be18cda01e1b 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > @@ -28,6 +28,10 @@
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/list.h>
> > >  #include <linux/anon_inodes.h>
> > > +#include <linux/iommu.h>
> > > +#include <linux/file.h>
> > > +#include <linux/vfio.h>
> > > +#include <linux/module.h>
> > >  
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/kvm_ppc.h>
> > > @@ -40,6 +44,36 @@
> > >  #include <asm/udbg.h>
> > >  #include <asm/iommu.h>
> > >  #include <asm/tce.h>
> > > +#include <asm/mmu_context.h>
> > > +
> > > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > > +{
> > > +	void (*fn)(struct vfio_group *);
> > > +
> > > +	fn = symbol_get(vfio_group_put_external_user);
> > > +	if (WARN_ON(!fn))
> > > +		return;
> > > +
> > > +	fn(vfio_group);
> > > +
> > > +	symbol_put(vfio_group_put_external_user);
> > > +}
> > > +
> > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > > +{
> > > +	int (*fn)(struct vfio_group *);
> > > +	int ret = -1;
> > > +
> > > +	fn = symbol_get(vfio_external_user_iommu_id);
> > > +	if (!fn)
> > > +		return ret;
> > > +
> > > +	ret = fn(vfio_group);
> > > +
> > > +	symbol_put(vfio_external_user_iommu_id);
> > > +
> > > +	return ret;
> > > +}  
> > 
> > 
> > Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> > iommu_group?  Why do you need to hold this additional vfio_group
> > reference?  
> 
> Keeping the vfio_group reference makes sense to me, since we don't
> want the vfio context for the group to go away while it's attached to
> the LIOBN.

But there's already a reference for that, it's taken by
KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL.  Both the
DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group()
before releasing that reference, so it seems entirely redundant.

> However, going via the iommu_id rather than just having an interface
> to directly grab the iommu group from the vfio_group seems bizarre to
> me.  I'm ok with cleaning that up later, however.

We have kvm_spapr_tce_attach_iommu_group() and
kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an
iommu_group as a parameter.  I don't particularly have a problem with
the vfio_group -> iommu ID -> iommu_group, but if we drop the extra
vfio_group reference and pass the iommu_group itself to these functions
then we can keep all the symbol reference stuff in the kvm-vfio glue
layer.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-15 16:18         ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-15 16:18 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Paul Mackerras, linuxppc-dev, kvm, kvm-ppc

On Wed, 15 Mar 2017 15:40:14 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > > index e96a4590464c..be18cda01e1b 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > @@ -28,6 +28,10 @@
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/list.h>
> > >  #include <linux/anon_inodes.h>
> > > +#include <linux/iommu.h>
> > > +#include <linux/file.h>
> > > +#include <linux/vfio.h>
> > > +#include <linux/module.h>
> > >  
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/kvm_ppc.h>
> > > @@ -40,6 +44,36 @@
> > >  #include <asm/udbg.h>
> > >  #include <asm/iommu.h>
> > >  #include <asm/tce.h>
> > > +#include <asm/mmu_context.h>
> > > +
> > > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > > +{
> > > +	void (*fn)(struct vfio_group *);
> > > +
> > > +	fn = symbol_get(vfio_group_put_external_user);
> > > +	if (WARN_ON(!fn))
> > > +		return;
> > > +
> > > +	fn(vfio_group);
> > > +
> > > +	symbol_put(vfio_group_put_external_user);
> > > +}
> > > +
> > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > > +{
> > > +	int (*fn)(struct vfio_group *);
> > > +	int ret = -1;
> > > +
> > > +	fn = symbol_get(vfio_external_user_iommu_id);
> > > +	if (!fn)
> > > +		return ret;
> > > +
> > > +	ret = fn(vfio_group);
> > > +
> > > +	symbol_put(vfio_external_user_iommu_id);
> > > +
> > > +	return ret;
> > > +}  
> > 
> > 
> > Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> > iommu_group?  Why do you need to hold this additional vfio_group
> > reference?  
> 
> Keeping the vfio_group reference makes sense to me, since we don't
> want the vfio context for the group to go away while it's attached to
> the LIOBN.

But there's already a reference for that, it's taken by
KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL.  Both the
DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group()
before releasing that reference, so it seems entirely redundant.

> However, going via the iommu_id rather than just having an interface
> to directly grab the iommu group from the vfio_group seems bizarre to
> me.  I'm ok with cleaning that up later, however.

We have kvm_spapr_tce_attach_iommu_group() and
kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an
iommu_group as a parameter.  I don't particularly have a problem with
the vfio_group -> iommu ID -> iommu_group, but if we drop the extra
vfio_group reference and pass the iommu_group itself to these functions
then we can keep all the symbol reference stuff in the kvm-vfio glue
layer.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-15 13:21       ` Alexey Kardashevskiy
  (?)
@ 2017-03-15 16:39         ` Alex Williamson
  -1 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-15 16:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, linuxppc-dev, kvm, kvm-ppc, David Gibson

On Thu, 16 Mar 2017 00:21:07 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 15/03/17 08:05, Alex Williamson wrote:
> > On Fri, 10 Mar 2017 14:53:37 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> If we fail to update a hardware IOMMU table unexpected reason, we just
> >> clear it and move on as there is nothing really we can do about it -
> >> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> >> will be mirrored automatically to the hardware and there is no interface
> >> to report to the guest about possible failures.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This adds real mode version of WARN_ON_ONCE() as the generic version
> >> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> >> returns in the code, this also adds a check for already existing
> >> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v8:
> >> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> >> to handle them
> >> * changed vmalloc_to_phys() callers to return H_HARDWARE
> >> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> >> and added a comment about this in the code
> >> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> >> and do WARN_ON
> >> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> >> have all vmalloc_to_phys() callsites covered
> >>
> >> v7:
> >> * added realmode-friendly WARN_ON_ONCE_RM
> >>
> >> v6:
> >> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> >> * moved kvmppc_gpa_to_ua() to TCE validation
> >>
> >> v5:
> >> * changed error codes in multiple places
> >> * added bunch of WARN_ON() in places which should not really happen
> >> * adde a check that an iommu table is not attached already to LIOBN
> >> * dropped explicit calls to iommu_tce_clear_param_check/
> >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> >> call them anyway (since the previous patch)
> >> * if we fail to update a hardware IOMMU table for unexpected reason,
> >> this just clears the entry
> >>
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  8 files changed, 623 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;  
> > 
> > kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?  
> 
> Correct.
> 
> 
> >   
> >> +	@flags are not supported now, must be zero;  
> > 
> > We do this argsz/flags thing on vfio ioctls because ioctls are a bit
> > more of a restricted resource.  We don't want to burn through them so
> > we make them expandable.  I don't know that we have that restriction
> > here and the ADD/DEL support certainly doesn't include it.  Maybe this
> > isn't necessary?  
> 
> 
> It is not but since I am going to have padding there, I thought I give it a
> name. Totally pointless and "u8 padding[4]" is better?

Or since these are not ioctls, we simply consider them to have a very
specific and limited purpose.  If we have a need to expand on that,
define a new one that either builds on the existing ones, as
GROUP_SET_SPAPR_TCE builds on GROUP_ADD, or supersedes them.  For
ioctls, it's more costly if we abandon one, here it doesn't seem like a
big deal.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-15 16:39         ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-15 16:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On Thu, 16 Mar 2017 00:21:07 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 15/03/17 08:05, Alex Williamson wrote:
> > On Fri, 10 Mar 2017 14:53:37 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> If we fail to update a hardware IOMMU table unexpected reason, we just
> >> clear it and move on as there is nothing really we can do about it -
> >> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> >> will be mirrored automatically to the hardware and there is no interface
> >> to report to the guest about possible failures.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This adds real mode version of WARN_ON_ONCE() as the generic version
> >> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> >> returns in the code, this also adds a check for already existing
> >> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v8:
> >> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> >> to handle them
> >> * changed vmalloc_to_phys() callers to return H_HARDWARE
> >> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> >> and added a comment about this in the code
> >> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> >> and do WARN_ON
> >> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> >> have all vmalloc_to_phys() callsites covered
> >>
> >> v7:
> >> * added realmode-friendly WARN_ON_ONCE_RM
> >>
> >> v6:
> >> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> >> * moved kvmppc_gpa_to_ua() to TCE validation
> >>
> >> v5:
> >> * changed error codes in multiple places
> >> * added bunch of WARN_ON() in places which should not really happen
> >> * adde a check that an iommu table is not attached already to LIOBN
> >> * dropped explicit calls to iommu_tce_clear_param_check/
> >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> >> call them anyway (since the previous patch)
> >> * if we fail to update a hardware IOMMU table for unexpected reason,
> >> this just clears the entry
> >>
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  8 files changed, 623 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;  
> > 
> > kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?  
> 
> Correct.
> 
> 
> >   
> >> +	@flags are not supported now, must be zero;  
> > 
> > We do this argsz/flags thing on vfio ioctls because ioctls are a bit
> > more of a restricted resource.  We don't want to burn through them so
> > we make them expandable.  I don't know that we have that restriction
> > here and the ADD/DEL support certainly doesn't include it.  Maybe this
> > isn't necessary?  
> 
> 
> It is not but since I am going to have padding there, I thought I give it a
> name. Totally pointless and "u8 padding[4]" is better?

Or since these are not ioctls, we simply consider them to have a very
specific and limited purpose.  If we have a need to expand on that,
define a new one that either builds on the existing ones, as
GROUP_SET_SPAPR_TCE builds on GROUP_ADD, or supersedes them.  For
ioctls, it's more costly if we abandon one, here it doesn't seem like a
big deal.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-15 16:39         ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2017-03-15 16:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, linuxppc-dev, kvm, kvm-ppc, David Gibson

On Thu, 16 Mar 2017 00:21:07 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 15/03/17 08:05, Alex Williamson wrote:
> > On Fri, 10 Mar 2017 14:53:37 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> If we fail to update a hardware IOMMU table unexpected reason, we just
> >> clear it and move on as there is nothing really we can do about it -
> >> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> >> will be mirrored automatically to the hardware and there is no interface
> >> to report to the guest about possible failures.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This adds real mode version of WARN_ON_ONCE() as the generic version
> >> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> >> returns in the code, this also adds a check for already existing
> >> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v8:
> >> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> >> to handle them
> >> * changed vmalloc_to_phys() callers to return H_HARDWARE
> >> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> >> and added a comment about this in the code
> >> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> >> and do WARN_ON
> >> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> >> have all vmalloc_to_phys() callsites covered
> >>
> >> v7:
> >> * added realmode-friendly WARN_ON_ONCE_RM
> >>
> >> v6:
> >> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> >> * moved kvmppc_gpa_to_ua() to TCE validation
> >>
> >> v5:
> >> * changed error codes in multiple places
> >> * added bunch of WARN_ON() in places which should not really happen
> >> * adde a check that an iommu table is not attached already to LIOBN
> >> * dropped explicit calls to iommu_tce_clear_param_check/
> >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> >> call them anyway (since the previous patch)
> >> * if we fail to update a hardware IOMMU table for unexpected reason,
> >> this just clears the entry
> >>
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  8 files changed, 623 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;  
> > 
> > kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?  
> 
> Correct.
> 
> 
> >   
> >> +	@flags are not supported now, must be zero;  
> > 
> > We do this argsz/flags thing on vfio ioctls because ioctls are a bit
> > more of a restricted resource.  We don't want to burn through them so
> > we make them expandable.  I don't know that we have that restriction
> > here and the ADD/DEL support certainly doesn't include it.  Maybe this
> > isn't necessary?  
> 
> 
> It is not but since I am going to have padding there, I thought I give it a
> name. Totally pointless and "u8 padding[4]" is better?

Or since these are not ioctls, we simply consider them to have a very
specific and limited purpose.  If we have a need to expand on that,
define a new one that either builds on the existing ones, as
GROUP_SET_SPAPR_TCE builds on GROUP_ADD, or supersedes them.  For
ioctls, it's more costly if we abandon one, here it doesn't seem like a
big deal.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-15 16:39         ` Alex Williamson
@ 2017-03-15 23:39           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-15 23:39 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 16/03/17 03:39, Alex Williamson wrote:
> On Thu, 16 Mar 2017 00:21:07 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 15/03/17 08:05, Alex Williamson wrote:
>>> On Fri, 10 Mar 2017 14:53:37 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>>>> KVM tries to handle a TCE request in the real mode, if failed
>>>> it passes the request to the virtual mode to complete the operation.
>>>> If it a virtual mode handler fails, the request is passed to
>>>> the user space; this is not expected to happen though.
>>>>
>>>> To avoid dealing with page use counters (which is tricky in real mode),
>>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>>>> to pre-register the userspace memory. The very first TCE request will
>>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>>>> of the TCE table (iommu_table::it_userspace) is not allocated till
>>>> the very first mapping happens and we cannot call vmalloc in real mode.
>>>>
>>>> If we fail to update a hardware IOMMU table unexpected reason, we just
>>>> clear it and move on as there is nothing really we can do about it -
>>>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>>>> will be mirrored automatically to the hardware and there is no interface
>>>> to report to the guest about possible failures.
>>>>
>>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>>>> and associates a physical IOMMU table with the SPAPR TCE table (which
>>>> is a guest view of the hardware IOMMU table). The iommu_table object
>>>> is cached and referenced so we do not have to look up for it in real mode.
>>>>
>>>> This does not implement the UNSET counterpart as there is no use for it -
>>>> once the acceleration is enabled, the existing userspace won't
>>>> disable it unless a VFIO container is destroyed; this adds necessary
>>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>>>
>>>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>>>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>>>> descriptors with the same iommu_table (hardware IOMMU table) attached
>>>> to the same LIOBN; we do not remove duplicates though as
>>>> iommu_table_ops::exchange not just update a TCE entry (which is
>>>> shared among IOMMU groups) but also invalidates the TCE cache
>>>> (one per IOMMU group).
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> This adds real mode version of WARN_ON_ONCE() as the generic version
>>>> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
>>>> returns in the code, this also adds a check for already existing
>>>> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
>>>>
>>>> This finally makes use of vfio_external_user_iommu_id() which was
>>>> introduced quite some time ago and was considered for removal.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v8:
>>>> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
>>>> to handle them
>>>> * changed vmalloc_to_phys() callers to return H_HARDWARE
>>>> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
>>>> and added a comment about this in the code
>>>> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
>>>> and do WARN_ON
>>>> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
>>>> have all vmalloc_to_phys() callsites covered
>>>>
>>>> v7:
>>>> * added realmode-friendly WARN_ON_ONCE_RM
>>>>
>>>> v6:
>>>> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
>>>> * moved kvmppc_gpa_to_ua() to TCE validation
>>>>
>>>> v5:
>>>> * changed error codes in multiple places
>>>> * added bunch of WARN_ON() in places which should not really happen
>>>> * adde a check that an iommu table is not attached already to LIOBN
>>>> * dropped explicit calls to iommu_tce_clear_param_check/
>>>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>>>> call them anyway (since the previous patch)
>>>> * if we fail to update a hardware IOMMU table for unexpected reason,
>>>> this just clears the entry
>>>>
>>>> v4:
>>>> * added note to the commit log about allowing multiple updates of
>>>> the same IOMMU table;
>>>> * instead of checking for if any memory was preregistered, this
>>>> returns H_TOO_HARD if a specific page was not;
>>>> * fixed comments from v3 about error handling in many places;
>>>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>>>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>>>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>>>> the first attached table only (makes the code simpler);
>>>>
>>>> v3:
>>>> * simplified not to use VFIO group notifiers
>>>> * reworked cleanup, should be cleaner/simpler now
>>>>
>>>> v2:
>>>> * reworked to use new VFIO notifiers
>>>> * now same iommu_table may appear in the list several times, to be fixed later
>>>> ---
>>>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>>>  include/uapi/linux/kvm.h                   |   8 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
>>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>>>  virt/kvm/vfio.c                            |  60 ++++++
>>>>  8 files changed, 623 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740c67ca..f95d867168ea 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,25 @@ Groups:
>>>>  
>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>>>> +	allocated by sPAPR KVM.
>>>> +	kvm_device_attr.addr points to a struct:
>>>>  
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +	struct kvm_vfio_spapr_tce {
>>>> +		__u32	argsz;
>>>> +		__u32	flags;
>>>> +		__s32	groupfd;
>>>> +		__s32	tablefd;
>>>> +	};
>>>> +
>>>> +	where
>>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;  
>>>
>>> kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?  
>>
>> Correct.
>>
>>
>>>   
>>>> +	@flags are not supported now, must be zero;  
>>>
>>> We do this argsz/flags thing on vfio ioctls because ioctls are a bit
>>> more of a restricted resource.  We don't want to burn through them so
>>> we make them expandable.  I don't know that we have that restriction
>>> here and the ADD/DEL support certainly doesn't include it.  Maybe this
>>> isn't necessary?  
>>
>>
>> It is not but since I am going to have padding there, I thought I give it a
>> name. Totally pointless and "u8 padding[4]" is better?
> 
> Or since these are not ioctls, we simply consider them to have a very
> specific and limited purpose.  If we have a need to expand on that,
> define a new one that either builds on the existing ones, as
> GROUP_SET_SPAPR_TCE builds on GROUP_ADD, or supersedes them.  For
> ioctls, it's more costly if we abandon one, here it doesn't seem like a
> big deal.  Thanks,

Did you just suggest ditching the extra 4 bytes now? Sorry, I am not
following you here. I am just saying that having 3*u32 while the actual
size is 4*u32 is not good for binary interface.

Also, you have not noticed my other comment about ditching the group
reference or decided to leave it for David to comment? :) Thanks.



-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-15 23:39           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2017-03-15 23:39 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras, kvm-ppc, kvm

On 16/03/17 03:39, Alex Williamson wrote:
> On Thu, 16 Mar 2017 00:21:07 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 15/03/17 08:05, Alex Williamson wrote:
>>> On Fri, 10 Mar 2017 14:53:37 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>>>> KVM tries to handle a TCE request in the real mode, if failed
>>>> it passes the request to the virtual mode to complete the operation.
>>>> If it a virtual mode handler fails, the request is passed to
>>>> the user space; this is not expected to happen though.
>>>>
>>>> To avoid dealing with page use counters (which is tricky in real mode),
>>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>>>> to pre-register the userspace memory. The very first TCE request will
>>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>>>> of the TCE table (iommu_table::it_userspace) is not allocated till
>>>> the very first mapping happens and we cannot call vmalloc in real mode.
>>>>
>>>> If we fail to update a hardware IOMMU table unexpected reason, we just
>>>> clear it and move on as there is nothing really we can do about it -
>>>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>>>> will be mirrored automatically to the hardware and there is no interface
>>>> to report to the guest about possible failures.
>>>>
>>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>>>> and associates a physical IOMMU table with the SPAPR TCE table (which
>>>> is a guest view of the hardware IOMMU table). The iommu_table object
>>>> is cached and referenced so we do not have to look up for it in real mode.
>>>>
>>>> This does not implement the UNSET counterpart as there is no use for it -
>>>> once the acceleration is enabled, the existing userspace won't
>>>> disable it unless a VFIO container is destroyed; this adds necessary
>>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>>>
>>>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>>>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>>>> descriptors with the same iommu_table (hardware IOMMU table) attached
>>>> to the same LIOBN; we do not remove duplicates though as
>>>> iommu_table_ops::exchange not just update a TCE entry (which is
>>>> shared among IOMMU groups) but also invalidates the TCE cache
>>>> (one per IOMMU group).
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> This adds real mode version of WARN_ON_ONCE() as the generic version
>>>> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
>>>> returns in the code, this also adds a check for already existing
>>>> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
>>>>
>>>> This finally makes use of vfio_external_user_iommu_id() which was
>>>> introduced quite some time ago and was considered for removal.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v8:
>>>> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
>>>> to handle them
>>>> * changed vmalloc_to_phys() callers to return H_HARDWARE
>>>> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
>>>> and added a comment about this in the code
>>>> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
>>>> and do WARN_ON
>>>> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
>>>> have all vmalloc_to_phys() callsites covered
>>>>
>>>> v7:
>>>> * added realmode-friendly WARN_ON_ONCE_RM
>>>>
>>>> v6:
>>>> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
>>>> * moved kvmppc_gpa_to_ua() to TCE validation
>>>>
>>>> v5:
>>>> * changed error codes in multiple places
>>>> * added bunch of WARN_ON() in places which should not really happen
>>>> * adde a check that an iommu table is not attached already to LIOBN
>>>> * dropped explicit calls to iommu_tce_clear_param_check/
>>>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>>>> call them anyway (since the previous patch)
>>>> * if we fail to update a hardware IOMMU table for unexpected reason,
>>>> this just clears the entry
>>>>
>>>> v4:
>>>> * added note to the commit log about allowing multiple updates of
>>>> the same IOMMU table;
>>>> * instead of checking for if any memory was preregistered, this
>>>> returns H_TOO_HARD if a specific page was not;
>>>> * fixed comments from v3 about error handling in many places;
>>>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>>>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>>>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>>>> the first attached table only (makes the code simpler);
>>>>
>>>> v3:
>>>> * simplified not to use VFIO group notifiers
>>>> * reworked cleanup, should be cleaner/simpler now
>>>>
>>>> v2:
>>>> * reworked to use new VFIO notifiers
>>>> * now same iommu_table may appear in the list several times, to be fixed later
>>>> ---
>>>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>>>  include/uapi/linux/kvm.h                   |   8 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
>>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>>>  virt/kvm/vfio.c                            |  60 ++++++
>>>>  8 files changed, 623 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740c67ca..f95d867168ea 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,25 @@ Groups:
>>>>  
>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>>>> +	allocated by sPAPR KVM.
>>>> +	kvm_device_attr.addr points to a struct:
>>>>  
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +	struct kvm_vfio_spapr_tce {
>>>> +		__u32	argsz;
>>>> +		__u32	flags;
>>>> +		__s32	groupfd;
>>>> +		__s32	tablefd;
>>>> +	};
>>>> +
>>>> +	where
>>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;  
>>>
>>> kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?  
>>
>> Correct.
>>
>>
>>>   
>>>> +	@flags are not supported now, must be zero;  
>>>
>>> We do this argsz/flags thing on vfio ioctls because ioctls are a bit
>>> more of a restricted resource.  We don't want to burn through them so
>>> we make them expandable.  I don't know that we have that restriction
>>> here and the ADD/DEL support certainly doesn't include it.  Maybe this
>>> isn't necessary?  
>>
>>
>> It is not but since I am going to have padding there, I thought I give it a
>> name. Totally pointless and "u8 padding[4]" is better?
> 
> Or since these are not ioctls, we simply consider them to have a very
> specific and limited purpose.  If we have a need to expand on that,
> define a new one that either builds on the existing ones, as
> GROUP_SET_SPAPR_TCE builds on GROUP_ADD, or supersedes them.  For
> ioctls, it's more costly if we abandon one, here it doesn't seem like a
> big deal.  Thanks,

Did you just suggest ditching the extra 4 bytes now? Sorry, I am not
following you here. I am just saying that having 3*u32 while the actual
size is 4*u32 is not good for binary interface.

Also, you have not noticed my other comment about ditching the group
reference or decided to leave it for David to comment? :) Thanks.



-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-03-15 16:18         ` Alex Williamson
@ 2017-03-16  3:42           ` David Gibson
  -1 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-16  3:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 3340 bytes --]

On Wed, Mar 15, 2017 at 10:18:18AM -0600, Alex Williamson wrote:
> On Wed, 15 Mar 2017 15:40:14 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > > > index e96a4590464c..be18cda01e1b 100644
> > > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > > @@ -28,6 +28,10 @@
> > > >  #include <linux/hugetlb.h>
> > > >  #include <linux/list.h>
> > > >  #include <linux/anon_inodes.h>
> > > > +#include <linux/iommu.h>
> > > > +#include <linux/file.h>
> > > > +#include <linux/vfio.h>
> > > > +#include <linux/module.h>
> > > >  
> > > >  #include <asm/tlbflush.h>
> > > >  #include <asm/kvm_ppc.h>
> > > > @@ -40,6 +44,36 @@
> > > >  #include <asm/udbg.h>
> > > >  #include <asm/iommu.h>
> > > >  #include <asm/tce.h>
> > > > +#include <asm/mmu_context.h>
> > > > +
> > > > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > > > +{
> > > > +	void (*fn)(struct vfio_group *);
> > > > +
> > > > +	fn = symbol_get(vfio_group_put_external_user);
> > > > +	if (WARN_ON(!fn))
> > > > +		return;
> > > > +
> > > > +	fn(vfio_group);
> > > > +
> > > > +	symbol_put(vfio_group_put_external_user);
> > > > +}
> > > > +
> > > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > > > +{
> > > > +	int (*fn)(struct vfio_group *);
> > > > +	int ret = -1;
> > > > +
> > > > +	fn = symbol_get(vfio_external_user_iommu_id);
> > > > +	if (!fn)
> > > > +		return ret;
> > > > +
> > > > +	ret = fn(vfio_group);
> > > > +
> > > > +	symbol_put(vfio_external_user_iommu_id);
> > > > +
> > > > +	return ret;
> > > > +}  
> > > 
> > > 
> > > Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> > > iommu_group?  Why do you need to hold this additional vfio_group
> > > reference?  
> > 
> > Keeping the vfio_group reference makes sense to me, since we don't
> > want the vfio context for the group to go away while it's attached to
> > the LIOBN.
> 
> But there's already a reference for that, it's taken by
> KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL.  Both the
> DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group()
> before releasing that reference, so it seems entirely redundant.

Oh, good point.  And we already verify that the group has been ADDed
before setting the LIOBN association.

> > However, going via the iommu_id rather than just having an interface
> > to directly grab the iommu group from the vfio_group seems bizarre to
> > me.  I'm ok with cleaning that up later, however.
> 
> We have kvm_spapr_tce_attach_iommu_group() and
> kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an
> iommu_group as a parameter.  I don't particularly have a problem with
> the vfio_group -> iommu ID -> iommu_group, but if we drop the extra
> vfio_group reference and pass the iommu_group itself to these functions
> then we can keep all the symbol reference stuff in the kvm-vfio glue
> layer.  Thanks,

Makes sense.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-03-16  3:42           ` David Gibson
  0 siblings, 0 replies; 53+ messages in thread
From: David Gibson @ 2017-03-16  3:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 3340 bytes --]

On Wed, Mar 15, 2017 at 10:18:18AM -0600, Alex Williamson wrote:
> On Wed, 15 Mar 2017 15:40:14 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > > > index e96a4590464c..be18cda01e1b 100644
> > > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > > @@ -28,6 +28,10 @@
> > > >  #include <linux/hugetlb.h>
> > > >  #include <linux/list.h>
> > > >  #include <linux/anon_inodes.h>
> > > > +#include <linux/iommu.h>
> > > > +#include <linux/file.h>
> > > > +#include <linux/vfio.h>
> > > > +#include <linux/module.h>
> > > >  
> > > >  #include <asm/tlbflush.h>
> > > >  #include <asm/kvm_ppc.h>
> > > > @@ -40,6 +44,36 @@
> > > >  #include <asm/udbg.h>
> > > >  #include <asm/iommu.h>
> > > >  #include <asm/tce.h>
> > > > +#include <asm/mmu_context.h>
> > > > +
> > > > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > > > +{
> > > > +	void (*fn)(struct vfio_group *);
> > > > +
> > > > +	fn = symbol_get(vfio_group_put_external_user);
> > > > +	if (WARN_ON(!fn))
> > > > +		return;
> > > > +
> > > > +	fn(vfio_group);
> > > > +
> > > > +	symbol_put(vfio_group_put_external_user);
> > > > +}
> > > > +
> > > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > > > +{
> > > > +	int (*fn)(struct vfio_group *);
> > > > +	int ret = -1;
> > > > +
> > > > +	fn = symbol_get(vfio_external_user_iommu_id);
> > > > +	if (!fn)
> > > > +		return ret;
> > > > +
> > > > +	ret = fn(vfio_group);
> > > > +
> > > > +	symbol_put(vfio_external_user_iommu_id);
> > > > +
> > > > +	return ret;
> > > > +}  
> > > 
> > > 
> > > Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> > > iommu_group?  Why do you need to hold this additional vfio_group
> > > reference?  
> > 
> > Keeping the vfio_group reference makes sense to me, since we don't
> > want the vfio context for the group to go away while it's attached to
> > the LIOBN.
> 
> But there's already a reference for that, it's taken by
> KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL.  Both the
> DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group()
> before releasing that reference, so it seems entirely redundant.

Oh, good point.  And we already verify that the group has been ADDed
before setting the LIOBN association.

> > However, going via the iommu_id rather than just having an interface
> > to directly grab the iommu group from the vfio_group seems bizarre to
> > me.  I'm ok with cleaning that up later, however.
> 
> We have kvm_spapr_tce_attach_iommu_group() and
> kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an
> iommu_group as a parameter.  I don't particularly have a problem with
> the vfio_group -> iommu ID -> iommu_group, but if we drop the extra
> vfio_group reference and pass the iommu_group itself to these functions
> then we can keep all the symbol reference stuff in the kvm-vfio glue
> layer.  Thanks,

Makes sense.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2017-03-16  3:53 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10  3:53 [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2017-03-10  3:53 ` Alexey Kardashevskiy
2017-03-10  3:53 ` [PATCH kernel v8 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  3:53 ` [PATCH kernel v8 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  3:53 ` [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-14 18:21   ` Alex Williamson
2017-03-14 18:21     ` Alex Williamson
2017-03-10  3:53 ` [PATCH kernel v8 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-14 19:58   ` Alex Williamson
2017-03-14 19:58     ` Alex Williamson
2017-03-10  3:53 ` [PATCH kernel v8 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  3:53 ` [PATCH kernel v8 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  3:53 ` [PATCH kernel v8 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  3:53 ` [PATCH kernel v8 08/10] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  3:53 ` [PATCH kernel v8 09/10] KVM: PPC: iommu: Unify TCE checking Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  3:53 ` [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
2017-03-10  3:53   ` Alexey Kardashevskiy
2017-03-10  4:47   ` David Gibson
2017-03-10  4:47     ` David Gibson
2017-03-14 21:05   ` Alex Williamson
2017-03-14 21:05     ` Alex Williamson
2017-03-15  4:40     ` David Gibson
2017-03-15  4:40       ` David Gibson
2017-03-15 16:18       ` Alex Williamson
2017-03-15 16:18         ` Alex Williamson
2017-03-15 16:18         ` Alex Williamson
2017-03-16  3:42         ` David Gibson
2017-03-16  3:42           ` David Gibson
2017-03-15 13:21     ` Alexey Kardashevskiy
2017-03-15 13:21       ` Alexey Kardashevskiy
2017-03-15 16:39       ` Alex Williamson
2017-03-15 16:39         ` Alex Williamson
2017-03-15 16:39         ` Alex Williamson
2017-03-15 23:39         ` Alexey Kardashevskiy
2017-03-15 23:39           ` Alexey Kardashevskiy
2017-03-10  4:48 ` [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration David Gibson
2017-03-10  4:48   ` David Gibson
2017-03-14  0:54   ` Alexey Kardashevskiy
2017-03-14  0:54     ` Alexey Kardashevskiy
2017-03-14  0:55     ` David Gibson
2017-03-14  0:55       ` David Gibson
2017-03-14 17:59       ` Alex Williamson
2017-03-14 17:59         ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.