All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel v4 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-02-07  7:17 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Alex Williamson,
	Paul Mackerras, David Gibson

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on 283725af0bd2 which was a Linus master tree
6 days ago.

Please comment. Thanks.

Changes:
v4:
* addressed comments from v3
* updated subject lines with correct component names
* regrouped the patchset in order:
   - powerpc fixes;
   - vfio_spapr_tce driver fixes;
   - KVM/PPC fixes;
   - KVM+PPC+VFIO;
* everything except last 2 patches have "Reviewed-By: David"

v3:
* there was no full repost, only last patch was posted

v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.

Alexey Kardashevskiy (10):
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Separate TCE validation from update
  KVM: PPC: Use preregistered memory API to access TCE list
  KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/iommu.h           |  12 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   6 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  49 ++++-
 arch/powerpc/kvm/book3s_64_vio.c           | 334 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 249 ++++++++++++++++++---
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  42 +++-
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            |  60 ++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 18 files changed, 794 insertions(+), 51 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-02-07  7:17 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on 283725af0bd2 which was a Linus master tree
6 days ago.

Please comment. Thanks.

Changes:
v4:
* addressed comments from v3
* updated subject lines with correct component names
* regrouped the patchset in order:
   - powerpc fixes;
   - vfio_spapr_tce driver fixes;
   - KVM/PPC fixes;
   - KVM+PPC+VFIO;
* everything except last 2 patches have "Reviewed-By: David"

v3:
* there was no full repost, only last patch was posted

v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.

Alexey Kardashevskiy (10):
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Separate TCE validation from update
  KVM: PPC: Use preregistered memory API to access TCE list
  KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/iommu.h           |  12 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   6 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  49 ++++-
 arch/powerpc/kvm/book3s_64_vio.c           | 334 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 249 ++++++++++++++++++---
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  42 +++-
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            |  60 ++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 18 files changed, 794 insertions(+), 51 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2017-02-07  7:17 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, kvm-ppc, Alex Williamson,
	Paul Mackerras, David Gibson

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on 283725af0bd2 which was a Linus master tree
6 days ago.

Please comment. Thanks.

Changes:
v4:
* addressed comments from v3
* updated subject lines with correct component names
* regrouped the patchset in order:
   - powerpc fixes;
   - vfio_spapr_tce driver fixes;
   - KVM/PPC fixes;
   - KVM+PPC+VFIO;
* everything except last 2 patches have "Reviewed-By: David"

v3:
* there was no full repost, only last patch was posted

v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.

Alexey Kardashevskiy (10):
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Separate TCE validation from update
  KVM: PPC: Use preregistered memory API to access TCE list
  KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/iommu.h           |  12 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   6 +-
 arch/powerpc/include/asm/mmu_context.h     |   4 +
 include/uapi/linux/kvm.h                   |   9 +
 arch/powerpc/kernel/iommu.c                |  49 ++++-
 arch/powerpc/kvm/book3s_64_vio.c           | 334 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 249 ++++++++++++++++++---
 arch/powerpc/kvm/powerpc.c                 |   2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  39 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c  |  42 +++-
 arch/powerpc/platforms/powernv/pci.c       |   1 +
 arch/powerpc/platforms/pseries/iommu.c     |   3 +-
 arch/powerpc/platforms/pseries/vio.c       |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c        |   2 +-
 virt/kvm/vfio.c                            |  60 ++++++
 arch/powerpc/kvm/Kconfig                   |   1 +
 18 files changed, 794 insertions(+), 51 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 104bad029ce9..631d32f5937b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *pa;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	pa = (void *) vmalloc_to_phys(va);
+	if (!pa)
+		return -EFAULT;
+
+	*hpa = *pa | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 104bad029ce9..631d32f5937b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *pa;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	pa = (void *) vmalloc_to_phys(va);
+	if (!pa)
+		return -EFAULT;
+
+	*hpa = *pa | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..4554699aec02 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..9bace5df05d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1004,6 +1004,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+			(*direction == DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5fcae29107e1..1c2826fa812e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1856,6 +1856,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1870,6 +1881,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1944,7 +1956,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -2000,6 +2012,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -2020,6 +2043,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..4554699aec02 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..9bace5df05d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1004,6 +1004,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
+			(*direction = DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5fcae29107e1..1c2826fa812e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1856,6 +1856,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1870,6 +1881,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1944,7 +1956,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -2000,6 +2012,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -2020,6 +2043,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kernel/iommu.c               | 4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9bace5df05d5..bc142d87130f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1c2826fa812e..cf0d8aed652a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2036,7 +2035,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2363,7 +2361,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2449,7 +2447,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 14c62a8495a2..6e79be9dfb43 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kernel/iommu.c               | 4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9bace5df05d5..bc142d87130f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1c2826fa812e..cf0d8aed652a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2036,7 +2035,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2363,7 +2361,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2449,7 +2447,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 14c62a8495a2..6e79be9dfb43 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 arch/powerpc/platforms/pseries/vio.c      |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4554699aec02..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -119,6 +119,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index bc142d87130f..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index cf0d8aed652a..f2c2ab8fbb3e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2221,7 +2221,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2315,7 +2315,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2361,7 +2361,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2447,7 +2447,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3417,7 +3417,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3444,7 +3444,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c6d554fe585c..471210913e42 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index dc2577fc5fbb..47f0501a94f9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 2c8fb3ec989e..41e8aa5c0d6a 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 6e79be9dfb43..b29ccb54aa7a 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 arch/powerpc/platforms/pseries/vio.c      |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4554699aec02..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -119,6 +119,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index bc142d87130f..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index cf0d8aed652a..f2c2ab8fbb3e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2221,7 +2221,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2315,7 +2315,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2361,7 +2361,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2447,7 +2447,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3417,7 +3417,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3444,7 +3444,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c6d554fe585c..471210913e42 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index dc2577fc5fbb..47f0501a94f9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 2c8fb3ec989e..41e8aa5c0d6a 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 6e79be9dfb43..b29ccb54aa7a 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl, container->mm);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index cac48eda1075..a2c9bb5a0ead 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -871,6 +871,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index cac48eda1075..a2c9bb5a0ead 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -871,6 +871,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table()
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 2da67bf1f2ec..37bc9e7e90ba 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5a4438..15df8ae627d9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index e4c4ea973e57..918af76ab2b6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -48,10 +48,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -182,12 +181,13 @@ EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -240,7 +240,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -301,7 +301,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -322,12 +322,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table()
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 2da67bf1f2ec..37bc9e7e90ba 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5a4438..15df8ae627d9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index e4c4ea973e57..918af76ab2b6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -48,10 +48,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -182,12 +181,13 @@ EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -240,7 +240,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -301,7 +301,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -322,12 +322,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

For the emulated devices it does not matter much if we get a broken TCE
half way handling a TCE list but for VFIO it will matter as it has
more chances to fail so we try to do our best and check as much as we
can before proceeding.

This separates a guest view table update from validation. No change in
behavior is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..9a7b7fca5e84 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		if (get_user(tce, tces + i)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
+		tce = be64_to_cpu(tce);
 
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 918af76ab2b6..f8a54b7c788e 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret = H_SUCCESS;
-	unsigned long tces, entry, ua = 0;
+	unsigned long tces, entry, tce, ua = 0;
 	unsigned long *rmap = NULL;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
@@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 	for (i = 0; i < npages; ++i) {
-		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
+		tce = be64_to_cpu(((u64 *)tces)[i]);
 
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(((u64 *)tces)[i]);
 
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

For the emulated devices it does not matter much if we get a broken TCE
half way handling a TCE list but for VFIO it will matter as it has
more chances to fail so we try to do our best and check as much as we
can before proceeding.

This separates a guest view table update from validation. No change in
behavior is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..9a7b7fca5e84 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		if (get_user(tce, tces + i)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
+		tce = be64_to_cpu(tce);
 
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 918af76ab2b6..f8a54b7c788e 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret = H_SUCCESS;
-	unsigned long tces, entry, ua = 0;
+	unsigned long tces, entry, tce, ua = 0;
 	unsigned long *rmap = NULL;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
@@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 	for (i = 0; i < npages; ++i) {
-		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
+		tce = be64_to_cpu(((u64 *)tces)[i]);
 
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(((u64 *)tces)[i]);
 
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* removed oneline inlines
* now falls back to locking rmap if TCE list is not in preregistered memory

v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index f8a54b7c788e..dc1c66fda941 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	long i, ret = H_SUCCESS;
 	unsigned long tces, entry, tce, ua = 0;
 	unsigned long *rmap = NULL;
+	bool prereg = false;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (mm_iommu_preregistered(vcpu->kvm->mm)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
+		if (mem)
+			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
+	}
+
+	if (!prereg) {
+		/*
+		 * This is usually a case of a guest with emulated devices only
+		 * when TCE list is not in preregistered memory.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -293,7 +318,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* removed oneline inlines
* now falls back to locking rmap if TCE list is not in preregistered memory

v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index f8a54b7c788e..dc1c66fda941 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	long i, ret = H_SUCCESS;
 	unsigned long tces, entry, tce, ua = 0;
 	unsigned long *rmap = NULL;
+	bool prereg = false;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (mm_iommu_preregistered(vcpu->kvm->mm)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
+		if (mem)
+			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) = 0;
+	}
+
+	if (!prereg) {
+		/*
+		 * This is usually a case of a guest with emulated devices only
+		 * when TCE list is not in preregistered memory.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -293,7 +318,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-02-07  7:17 ` Alexey Kardashevskiy
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN; we do not remove duplicates though as
iommu_table_ops::exchange not just update a TCE entry (which is
shared among IOMMU groups) but also invalidates the TCE cache
(one per IOMMU group).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* added note to the commit log about allowing multiple updates of
the same IOMMU table;
* instead of checking for if any memory was preregistered, this
returns H_TOO_HARD if a specific page was not;
* fixed comments from v3 about error handling in many places;
* simplified TCE handlers and merged IOMMU parts inline - for example,
there used to be kvmppc_h_put_tce_iommu(), now it is merged into
kvmppc_h_put_tce(); this allows to check IOBA boundaries against
the first attached table only (makes the code simpler);

v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---

This has separate copies of handlers for real and virtual modes as
in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
in virtual mode and direct access in real mode and having a common
helper for it would make things uglier imho.


---
 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  60 ++++++
 8 files changed, 590 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e59b172666cd..a827006941f8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 37bc9e7e90ba..da1410bd6b36 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group);
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a2c9bb5a0ead..cdfa01169bd2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1076,6 +1076,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1097,6 +1098,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 9a7b7fca5e84..cb0469151e35 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,36 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (WARN_ON(!fn))
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
+
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	iommu_table_put(stit->tbl);
+	kvm_vfio_group_put_external_user(stit->group);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_liobn_release_iommu_group(
+		struct kvmppc_spapr_tce_table *stt,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
+
+	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
+		if (group && (stit->group != group))
+			continue;
+
+		list_del_rcu(&stit->next);
+
+		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+	}
+}
+
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
+		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i, ret = 0;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(group);
+	grp = iommu_group_get_by_id(group_id);
+	if (!grp)
+		return -EFAULT;
+
+	f = fdget(tablefd);
+	if (!f.file) {
+		ret = -EBADF;
+		goto put_exit;
+	}
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt == f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found) {
+		ret = -ENODEV;
+		goto put_exit;
+	}
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (WARN_ON(!table_group)) {
+		ret = -EFAULT;
+		goto put_exit;
+	}
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		/*
+		 * Make sure hardware table parameters are exactly the same;
+		 * this is used in the TCE handlers where boundary checks
+		 * use only the first attached table.
+		 */
+		if ((tbltmp->it_page_shift == stt->page_shift) &&
+				(tbltmp->it_offset == stt->offset) &&
+				(tbltmp->it_size == stt->size)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl) {
+		ret = -ENODEV;
+		goto put_exit;
+	}
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+put_exit:
+	iommu_group_put(grp);
+
+	return ret;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 
 	list_del_rcu(&stt->list);
 
+	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret != H_SUCCESS)
+		iommu_tce_xchg(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_PARAMETER;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_TOO_HARD;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
-	long ret;
+	long ret, idx;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, gpa;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+	if (stit) {
+		entry = ioba >> stit->tbl->it_page_shift;
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+		dir = iommu_tce_direction(tce);
+
+		if (dir == DMA_NONE) {
+			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
+				return H_PARAMETER;
+		} else {
+			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
+				return H_PARAMETER;
+		}
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			if (dir == DMA_NONE) {
+				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+						stit->tbl, entry);
+			} else {
+				idx = srcu_read_lock(&vcpu->kvm->srcu);
+				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
+						entry, gpa, dir);
+				srcu_read_unlock(&vcpu->kvm->srcu, idx);
+			}
+			if (ret != H_SUCCESS)
+				return ret;
+		}
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret = H_SUCCESS, idx;
-	unsigned long entry, ua = 0;
+	unsigned long entry, gpa, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
+
+		if (stit) {
+			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+			ret = iommu_tce_put_param_check(stit->tbl,
+					ioba + (i << stit->tbl->it_page_shift),
+					gpa);
+			if (ret != H_SUCCESS)
+				goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		}
 		tce = be64_to_cpu(tce);
 
+		if (stit) {
+			for (i = 0; i < npages; ++i) {
+				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+				list_for_each_entry_lockless(stit,
+						&stt->iommu_tables, next) {
+					ret = kvmppc_tce_iommu_map(vcpu->kvm,
+						stit->tbl, entry + i, gpa,
+						iommu_tce_direction(tce));
+					if (ret != H_SUCCESS)
+						goto unlock_exit;
+				}
+			}
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+	if (stit) {
+		if (iommu_tce_clear_param_check(stit->tbl, ioba,
+					tce_value, npages))
+			return H_PARAMETER;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+			for (i = 0; i < npages; ++i) {
+				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+						stit->tbl, entry + i);
+				if (ret)
+					return ret;
+			}
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index dc1c66fda941..018c7d94a575 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret)
+		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_PARAMETER;
+
+	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_TOO_HARD;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, gpa;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+	if (stit) {
+		entry = ioba >> stit->tbl->it_page_shift;
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+		dir = iommu_tce_direction(tce);
+
+		if (dir == DMA_NONE) {
+			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
+				return H_PARAMETER;
+		} else {
+			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
+				return H_PARAMETER;
+		}
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			if (dir == DMA_NONE)
+				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+						stit->tbl, entry);
+			else
+				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
+						entry, gpa, dir);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret = H_SUCCESS;
-	unsigned long tces, entry, tce, ua = 0;
+	unsigned long tces, entry, gpa, tce, ua = 0;
 	unsigned long *rmap = NULL;
 	bool prereg = false;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		}
 	}
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+
 	for (i = 0; i < npages; ++i) {
 		tce = be64_to_cpu(((u64 *)tces)[i]);
 
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
+
+		if (stit) {
+			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+			ret = iommu_tce_put_param_check(stit->tbl,
+					ioba + (i << stit->tbl->it_page_shift),
+					gpa);
+			if (ret != H_SUCCESS)
+				goto unlock_exit;
+
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
 		tce = be64_to_cpu(((u64 *)tces)[i]);
 
+		if (stit) {
+			for (i = 0; i < npages; ++i) {
+				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+				list_for_each_entry_lockless(stit,
+						&stt->iommu_tables, next) {
+					ret = kvmppc_rm_tce_iommu_map(vcpu,
+						stit->tbl, entry + i, gpa,
+						iommu_tce_direction(tce));
+					if (ret != H_SUCCESS)
+						goto unlock_exit;
+				}
+			}
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+	if (stit) {
+		if (iommu_tce_clear_param_check(stit->tbl, ioba,
+					tce_value, npages))
+			return H_PARAMETER;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+			for (i = 0; i < npages; ++i) {
+				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+						stit->tbl, entry + i);
+				if (ret)
+					return ret;
+			}
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index cd892dec7cb6..f3127dc87912 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
 	case KVM_CAP_PPC_ENABLE_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..2b7dc22265fe 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		mutex_unlock(&kv->lock);
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(vfio_group, NULL);
 
 		kvm_vfio_group_put_external_user(vfio_group);
@@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group);
+
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-02-07  7:17   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-07  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Paul Mackerras, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN; we do not remove duplicates though as
iommu_table_ops::exchange not just update a TCE entry (which is
shared among IOMMU groups) but also invalidates the TCE cache
(one per IOMMU group).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* added note to the commit log about allowing multiple updates of
the same IOMMU table;
* instead of checking for if any memory was preregistered, this
returns H_TOO_HARD if a specific page was not;
* fixed comments from v3 about error handling in many places;
* simplified TCE handlers and merged IOMMU parts inline - for example,
there used to be kvmppc_h_put_tce_iommu(), now it is merged into
kvmppc_h_put_tce(); this allows to check IOBA boundaries against
the first attached table only (makes the code simpler);

v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---

This has separate copies of handlers for real and virtual modes as
in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
in virtual mode and direct access in real mode and having a common
helper for it would make things uglier imho.


---
 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  60 ++++++
 8 files changed, 590 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e59b172666cd..a827006941f8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 37bc9e7e90ba..da1410bd6b36 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group);
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a2c9bb5a0ead..cdfa01169bd2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1076,6 +1076,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1097,6 +1098,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 9a7b7fca5e84..cb0469151e35 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,36 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (WARN_ON(!fn))
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
+
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	iommu_table_put(stit->tbl);
+	kvm_vfio_group_put_external_user(stit->group);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_liobn_release_iommu_group(
+		struct kvmppc_spapr_tce_table *stt,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
+
+	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
+		if (group && (stit->group != group))
+			continue;
+
+		list_del_rcu(&stit->next);
+
+		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+	}
+}
+
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
+		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i, ret = 0;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(group);
+	grp = iommu_group_get_by_id(group_id);
+	if (!grp)
+		return -EFAULT;
+
+	f = fdget(tablefd);
+	if (!f.file) {
+		ret = -EBADF;
+		goto put_exit;
+	}
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt = f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found) {
+		ret = -ENODEV;
+		goto put_exit;
+	}
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (WARN_ON(!table_group)) {
+		ret = -EFAULT;
+		goto put_exit;
+	}
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		/*
+		 * Make sure hardware table parameters are exactly the same;
+		 * this is used in the TCE handlers where boundary checks
+		 * use only the first attached table.
+		 */
+		if ((tbltmp->it_page_shift = stt->page_shift) &&
+				(tbltmp->it_offset = stt->offset) &&
+				(tbltmp->it_size = stt->size)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl) {
+		ret = -ENODEV;
+		goto put_exit;
+	}
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+put_exit:
+	iommu_group_put(grp);
+
+	return ret;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 
 	list_del_rcu(&stt->list);
 
+	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret != H_SUCCESS)
+		iommu_tce_xchg(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_PARAMETER;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_TOO_HARD;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
-	long ret;
+	long ret, idx;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, gpa;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+	if (stit) {
+		entry = ioba >> stit->tbl->it_page_shift;
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+		dir = iommu_tce_direction(tce);
+
+		if (dir = DMA_NONE) {
+			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
+				return H_PARAMETER;
+		} else {
+			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
+				return H_PARAMETER;
+		}
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			if (dir = DMA_NONE) {
+				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+						stit->tbl, entry);
+			} else {
+				idx = srcu_read_lock(&vcpu->kvm->srcu);
+				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
+						entry, gpa, dir);
+				srcu_read_unlock(&vcpu->kvm->srcu, idx);
+			}
+			if (ret != H_SUCCESS)
+				return ret;
+		}
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret = H_SUCCESS, idx;
-	unsigned long entry, ua = 0;
+	unsigned long entry, gpa, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
+
+		if (stit) {
+			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+			ret = iommu_tce_put_param_check(stit->tbl,
+					ioba + (i << stit->tbl->it_page_shift),
+					gpa);
+			if (ret != H_SUCCESS)
+				goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		}
 		tce = be64_to_cpu(tce);
 
+		if (stit) {
+			for (i = 0; i < npages; ++i) {
+				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+				list_for_each_entry_lockless(stit,
+						&stt->iommu_tables, next) {
+					ret = kvmppc_tce_iommu_map(vcpu->kvm,
+						stit->tbl, entry + i, gpa,
+						iommu_tce_direction(tce));
+					if (ret != H_SUCCESS)
+						goto unlock_exit;
+				}
+			}
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+	if (stit) {
+		if (iommu_tce_clear_param_check(stit->tbl, ioba,
+					tce_value, npages))
+			return H_PARAMETER;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+			for (i = 0; i < npages; ++i) {
+				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+						stit->tbl, entry + i);
+				if (ret)
+					return ret;
+			}
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index dc1c66fda941..018c7d94a575 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret)
+		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_PARAMETER;
+
+	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_TOO_HARD;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, gpa;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+	if (stit) {
+		entry = ioba >> stit->tbl->it_page_shift;
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+		dir = iommu_tce_direction(tce);
+
+		if (dir = DMA_NONE) {
+			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
+				return H_PARAMETER;
+		} else {
+			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
+				return H_PARAMETER;
+		}
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			if (dir = DMA_NONE)
+				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+						stit->tbl, entry);
+			else
+				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
+						entry, gpa, dir);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret = H_SUCCESS;
-	unsigned long tces, entry, tce, ua = 0;
+	unsigned long tces, entry, gpa, tce, ua = 0;
 	unsigned long *rmap = NULL;
 	bool prereg = false;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		}
 	}
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+
 	for (i = 0; i < npages; ++i) {
 		tce = be64_to_cpu(((u64 *)tces)[i]);
 
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
+
+		if (stit) {
+			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+			ret = iommu_tce_put_param_check(stit->tbl,
+					ioba + (i << stit->tbl->it_page_shift),
+					gpa);
+			if (ret != H_SUCCESS)
+				goto unlock_exit;
+
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
 		tce = be64_to_cpu(((u64 *)tces)[i]);
 
+		if (stit) {
+			for (i = 0; i < npages; ++i) {
+				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+				list_for_each_entry_lockless(stit,
+						&stt->iommu_tables, next) {
+					ret = kvmppc_rm_tce_iommu_map(vcpu,
+						stit->tbl, entry + i, gpa,
+						iommu_tce_direction(tce));
+					if (ret != H_SUCCESS)
+						goto unlock_exit;
+				}
+			}
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	stit = list_first_entry_or_null(&stt->iommu_tables,
+			struct kvmppc_spapr_tce_iommu_table, next);
+	if (stit) {
+		if (iommu_tce_clear_param_check(stit->tbl, ioba,
+					tce_value, npages))
+			return H_PARAMETER;
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+			for (i = 0; i < npages; ++i) {
+				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+						stit->tbl, entry + i);
+				if (ret)
+					return ret;
+			}
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index cd892dec7cb6..f3127dc87912 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
 	case KVM_CAP_PPC_ENABLE_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..2b7dc22265fe 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		mutex_unlock(&kv->lock);
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(vfio_group, NULL);
 
 		kvm_vfio_group_put_external_user(vfio_group);
@@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group);
+
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
  2017-02-07  7:17   ` Alexey Kardashevskiy
  (?)
@ 2017-02-09  3:51     ` David Gibson
  -1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  3:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 3040 bytes --]

On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> For the emulated devices it does not matter much if we get a broken TCE
> half way handling a TCE list but for VFIO it will matter as it has
> more chances to fail so we try to do our best and check as much as we
> can before proceeding.
> 
> This separates a guest view table update from validation. No change in
> behavior is expected.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 15df8ae627d9..9a7b7fca5e84 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (get_user(tce, tces + i)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
> +		tce = be64_to_cpu(tce);

This doesn't look safe.  The contents of user memory could change
between the two get_user()s, meaning that you're no longer guaranteed
a TCE loaded into kernel has been validated at all.

I think you need to either:

    a) Make sure things safe against a bad TCE being loaded into a TCE
    table and move all validation to where the TCE is used, rather
    than loaded

or
    b) Copy the whole set of indirect entries to a temporary in-kernel
       buffer, then validate, then load into the actual TCE table.

>  
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 918af76ab2b6..f8a54b7c788e 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, ua = 0;
> +	unsigned long tces, entry, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(((u64 *)tces)[i]);

Same problem here.

>  
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-09  3:51     ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  3:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 3040 bytes --]

On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> For the emulated devices it does not matter much if we get a broken TCE
> half way handling a TCE list but for VFIO it will matter as it has
> more chances to fail so we try to do our best and check as much as we
> can before proceeding.
> 
> This separates a guest view table update from validation. No change in
> behavior is expected.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 15df8ae627d9..9a7b7fca5e84 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (get_user(tce, tces + i)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
> +		tce = be64_to_cpu(tce);

This doesn't look safe.  The contents of user memory could change
between the two get_user()s, meaning that you're no longer guaranteed
a TCE loaded into kernel has been validated at all.

I think you need to either:

    a) Make sure things safe against a bad TCE being loaded into a TCE
    table and move all validation to where the TCE is used, rather
    than loaded

or
    b) Copy the whole set of indirect entries to a temporary in-kernel
       buffer, then validate, then load into the actual TCE table.

>  
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 918af76ab2b6..f8a54b7c788e 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, ua = 0;
> +	unsigned long tces, entry, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(((u64 *)tces)[i]);

Same problem here.

>  
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-09  3:51     ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  3:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 3040 bytes --]

On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> For the emulated devices it does not matter much if we get a broken TCE
> half way handling a TCE list but for VFIO it will matter as it has
> more chances to fail so we try to do our best and check as much as we
> can before proceeding.
> 
> This separates a guest view table update from validation. No change in
> behavior is expected.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 15df8ae627d9..9a7b7fca5e84 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (get_user(tce, tces + i)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
> +		tce = be64_to_cpu(tce);

This doesn't look safe.  The contents of user memory could change
between the two get_user()s, meaning that you're no longer guaranteed
a TCE loaded into kernel has been validated at all.

I think you need to either:

    a) Make sure things safe against a bad TCE being loaded into a TCE
    table and move all validation to where the TCE is used, rather
    than loaded

or
    b) Copy the whole set of indirect entries to a temporary in-kernel
       buffer, then validate, then load into the actual TCE table.

>  
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 918af76ab2b6..f8a54b7c788e 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, ua = 0;
> +	unsigned long tces, entry, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(((u64 *)tces)[i]);

Same problem here.

>  
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list
  2017-02-07  7:17   ` Alexey Kardashevskiy
  (?)
@ 2017-02-09  4:00     ` David Gibson
  -1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  4:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 4561 bytes --]

On Tue, Feb 07, 2017 at 06:17:10PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL
> where declared (not in this patch).
> 
> If a requested chunk of memory has not been preregistered, this will
> fall back to non-preregistered case and lock rmap.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v4:
> * removed oneline inlines
> * now falls back to locking rmap if TCE list is not in preregistered memory
> 
> v2:
> * updated the commit log with David's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++++++++++++++++++++++++++----------
>  1 file changed, 42 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index f8a54b7c788e..dc1c66fda941 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	long i, ret = H_SUCCESS;
>  	unsigned long tces, entry, tce, ua = 0;
>  	unsigned long *rmap = NULL;
> +	bool prereg = false;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (mm_iommu_preregistered(vcpu->kvm->mm)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;
>  
> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
> +		if (mem)
> +			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
> +	}
> +
> +	if (!prereg) {
> +		/*
> +		 * This is usually a case of a guest with emulated devices only
> +		 * when TCE list is not in preregistered memory.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;
> +
> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -293,7 +318,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)
> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list
@ 2017-02-09  4:00     ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  4:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4561 bytes --]

On Tue, Feb 07, 2017 at 06:17:10PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL
> where declared (not in this patch).
> 
> If a requested chunk of memory has not been preregistered, this will
> fall back to non-preregistered case and lock rmap.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v4:
> * removed oneline inlines
> * now falls back to locking rmap if TCE list is not in preregistered memory
> 
> v2:
> * updated the commit log with David's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++++++++++++++++++++++++++----------
>  1 file changed, 42 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index f8a54b7c788e..dc1c66fda941 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	long i, ret = H_SUCCESS;
>  	unsigned long tces, entry, tce, ua = 0;
>  	unsigned long *rmap = NULL;
> +	bool prereg = false;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (mm_iommu_preregistered(vcpu->kvm->mm)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;
>  
> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
> +		if (mem)
> +			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
> +	}
> +
> +	if (!prereg) {
> +		/*
> +		 * This is usually a case of a guest with emulated devices only
> +		 * when TCE list is not in preregistered memory.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;
> +
> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -293,7 +318,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)
> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list
@ 2017-02-09  4:00     ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  4:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 4561 bytes --]

On Tue, Feb 07, 2017 at 06:17:10PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL
> where declared (not in this patch).
> 
> If a requested chunk of memory has not been preregistered, this will
> fall back to non-preregistered case and lock rmap.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v4:
> * removed oneline inlines
> * now falls back to locking rmap if TCE list is not in preregistered memory
> 
> v2:
> * updated the commit log with David's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++++++++++++++++++++++++++----------
>  1 file changed, 42 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index f8a54b7c788e..dc1c66fda941 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	long i, ret = H_SUCCESS;
>  	unsigned long tces, entry, tce, ua = 0;
>  	unsigned long *rmap = NULL;
> +	bool prereg = false;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (mm_iommu_preregistered(vcpu->kvm->mm)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;
>  
> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
> +		if (mem)
> +			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
> +	}
> +
> +	if (!prereg) {
> +		/*
> +		 * This is usually a case of a guest with emulated devices only
> +		 * when TCE list is not in preregistered memory.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;
> +
> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -293,7 +318,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)
> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-02-07  7:17   ` Alexey Kardashevskiy
  (?)
@ 2017-02-09  6:41     ` David Gibson
  -1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  6:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 32625 bytes --]

On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This has separate copies of handlers for real and virtual modes as
> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> in virtual mode and direct access in real mode and having a common
> helper for it would make things uglier imho.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 590 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e59b172666cd..a827006941f8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 37bc9e7e90ba..da1410bd6b36 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a2c9bb5a0ead..cdfa01169bd2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 9a7b7fca5e84..cb0469151e35 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (!grp)
> +		return -EFAULT;

EFAULT doesn't look right, that's usually means userspace has give us
a bad address.  What does failure to look up the iommu group by id
mean here?


> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -ENODEV;

ENODEV doesn't look right either.  That generally means you're trying
to use a device or facility that doesn't exist.  This case just means
you've passed a file handle that either isn't a TCE table at all, or
os one associated with a different VM.  -EINVAL, I guess, overloaded
as it is.

> +		goto put_exit;

Don't you need to put the table fd as well as the iommu group which
you put in that exit path?

> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;

Again don't you need to put the table fd as well.

> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -ENODEV;

Again, ENODEV doesn't seem right.  Here the problem is that the host
hardware constraints don't match the guest hardware constraints.
Hmm.  EIO?  ENOSPC?

> +		goto put_exit;
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);

So if you add the same group to the same liobn multiple times, you'll
get multiple identical entries in this list.

I guess that's mostly harmless... although.. does it allow the user to
force the allocation of arbitrary amounts of kernel memory in that
list?

> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;

What could trigger this error?  Should it be a WARN_ON?

> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;

IIUC this would happen if qemu had failed to preregister all of guest
RAM, making this indeed an H_HARDWARE.

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;

I'm less clear on when this one would happen.

> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))

Any way you could make these param check functions based on stt
instead of stit->tbl?  That would let you do them before checking if
there are any hw tables to update, avaoiding the somewhat awkward
	if (at least one)
		for (each one)
construct.

> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			} else {
> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +						entry, gpa, dir);
> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +			}
> +			if (ret != H_SUCCESS)
> +				return ret;

Doesn't this error path need to clean up for the case where you
managed to update some backing TCE tables, but then failed later ones?

> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS, idx;
> -	unsigned long entry, ua = 0;
> +	unsigned long entry, gpa, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  		tce = be64_to_cpu(tce);
>  
> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}

Um.. what value will this for_each leave in stit after completion?  I
suspect it will be something bogus, which means re-using stit in the
next 0..npages loop iteration won't be safe (you only initialize stit
with the first entry outside that loop).

> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;

Again do you need some sort of cleanup for partial completion?


> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index dc1c66fda941..018c7d94a575 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;

What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?

> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE)
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			else
> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> +						entry, gpa, dir);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, tce, ua = 0;
> +	unsigned long tces, entry, gpa, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  	}
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);

As noted in the earlier patch this is really dangerous - by reloading
the tce from userspace you've thrown away the verification above.

> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}
> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;
> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index cd892dec7cb6..f3127dc87912 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */

I'm not sure why this one should get a fallthrough comment, when none
of the other cases do.

> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +


Is there any particular reason you unwrap the group fd here, but the
table fd inside kvm__spapr_tce_attach_iommu_group()?

> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-02-09  6:41     ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  6:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 32625 bytes --]

On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This has separate copies of handlers for real and virtual modes as
> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> in virtual mode and direct access in real mode and having a common
> helper for it would make things uglier imho.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 590 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e59b172666cd..a827006941f8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 37bc9e7e90ba..da1410bd6b36 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a2c9bb5a0ead..cdfa01169bd2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 9a7b7fca5e84..cb0469151e35 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (!grp)
> +		return -EFAULT;

EFAULT doesn't look right, that's usually means userspace has give us
a bad address.  What does failure to look up the iommu group by id
mean here?


> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -ENODEV;

ENODEV doesn't look right either.  That generally means you're trying
to use a device or facility that doesn't exist.  This case just means
you've passed a file handle that either isn't a TCE table at all, or
os one associated with a different VM.  -EINVAL, I guess, overloaded
as it is.

> +		goto put_exit;

Don't you need to put the table fd as well as the iommu group which
you put in that exit path?

> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;

Again don't you need to put the table fd as well.

> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -ENODEV;

Again, ENODEV doesn't seem right.  Here the problem is that the host
hardware constraints don't match the guest hardware constraints.
Hmm.  EIO?  ENOSPC?

> +		goto put_exit;
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);

So if you add the same group to the same liobn multiple times, you'll
get multiple identical entries in this list.

I guess that's mostly harmless... although.. does it allow the user to
force the allocation of arbitrary amounts of kernel memory in that
list?

> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;

What could trigger this error?  Should it be a WARN_ON?

> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;

IIUC this would happen if qemu had failed to preregister all of guest
RAM, making this indeed an H_HARDWARE.

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;

I'm less clear on when this one would happen.

> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))

Any way you could make these param check functions based on stt
instead of stit->tbl?  That would let you do them before checking if
there are any hw tables to update, avaoiding the somewhat awkward
	if (at least one)
		for (each one)
construct.

> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			} else {
> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +						entry, gpa, dir);
> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +			}
> +			if (ret != H_SUCCESS)
> +				return ret;

Doesn't this error path need to clean up for the case where you
managed to update some backing TCE tables, but then failed later ones?

> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS, idx;
> -	unsigned long entry, ua = 0;
> +	unsigned long entry, gpa, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  		tce = be64_to_cpu(tce);
>  
> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}

Um.. what value will this for_each leave in stit after completion?  I
suspect it will be something bogus, which means re-using stit in the
next 0..npages loop iteration won't be safe (you only initialize stit
with the first entry outside that loop).

> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;

Again do you need some sort of cleanup for partial completion?


> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index dc1c66fda941..018c7d94a575 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;

What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?

> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE)
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			else
> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> +						entry, gpa, dir);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, tce, ua = 0;
> +	unsigned long tces, entry, gpa, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  	}
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);

As noted in the earlier patch this is really dangerous - by reloading
the tce from userspace you've thrown away the verification above.

> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}
> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;
> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index cd892dec7cb6..f3127dc87912 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */

I'm not sure why this one should get a fallthrough comment, when none
of the other cases do.

> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +


Is there any particular reason you unwrap the group fd here, but the
table fd inside kvm__spapr_tce_attach_iommu_group()?

> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-02-09  6:41     ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-09  6:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 32625 bytes --]

On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This has separate copies of handlers for real and virtual modes as
> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> in virtual mode and direct access in real mode and having a common
> helper for it would make things uglier imho.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 590 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e59b172666cd..a827006941f8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 37bc9e7e90ba..da1410bd6b36 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a2c9bb5a0ead..cdfa01169bd2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 9a7b7fca5e84..cb0469151e35 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (!grp)
> +		return -EFAULT;

EFAULT doesn't look right, that's usually means userspace has give us
a bad address.  What does failure to look up the iommu group by id
mean here?


> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -ENODEV;

ENODEV doesn't look right either.  That generally means you're trying
to use a device or facility that doesn't exist.  This case just means
you've passed a file handle that either isn't a TCE table at all, or
os one associated with a different VM.  -EINVAL, I guess, overloaded
as it is.

> +		goto put_exit;

Don't you need to put the table fd as well as the iommu group which
you put in that exit path?

> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;

Again don't you need to put the table fd as well.

> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -ENODEV;

Again, ENODEV doesn't seem right.  Here the problem is that the host
hardware constraints don't match the guest hardware constraints.
Hmm.  EIO?  ENOSPC?

> +		goto put_exit;
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);

So if you add the same group to the same liobn multiple times, you'll
get multiple identical entries in this list.

I guess that's mostly harmless... although.. does it allow the user to
force the allocation of arbitrary amounts of kernel memory in that
list?

> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;

What could trigger this error?  Should it be a WARN_ON?

> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;

IIUC this would happen if qemu had failed to preregister all of guest
RAM, making this indeed an H_HARDWARE.

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;

I'm less clear on when this one would happen.

> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))

Any way you could make these param check functions based on stt
instead of stit->tbl?  That would let you do them before checking if
there are any hw tables to update, avaoiding the somewhat awkward
	if (at least one)
		for (each one)
construct.

> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			} else {
> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +						entry, gpa, dir);
> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +			}
> +			if (ret != H_SUCCESS)
> +				return ret;

Doesn't this error path need to clean up for the case where you
managed to update some backing TCE tables, but then failed later ones?

> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS, idx;
> -	unsigned long entry, ua = 0;
> +	unsigned long entry, gpa, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  		tce = be64_to_cpu(tce);
>  
> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}

Um.. what value will this for_each leave in stit after completion?  I
suspect it will be something bogus, which means re-using stit in the
next 0..npages loop iteration won't be safe (you only initialize stit
with the first entry outside that loop).

> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;

Again do you need some sort of cleanup for partial completion?


> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index dc1c66fda941..018c7d94a575 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;

What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?

> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE)
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			else
> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> +						entry, gpa, dir);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, tce, ua = 0;
> +	unsigned long tces, entry, gpa, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  	}
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);

As noted in the earlier patch this is really dangerous - by reloading
the tce from userspace you've thrown away the verification above.

> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}
> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;
> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index cd892dec7cb6..f3127dc87912 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */

I'm not sure why this one should get a fallthrough comment, when none
of the other cases do.

> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +


Is there any particular reason you unwrap the group fd here, but the
table fd inside kvm__spapr_tce_attach_iommu_group()?

> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
  2017-02-09  3:51     ` David Gibson
@ 2017-02-09  8:20       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-09  8:20 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 5068 bytes --]

On 09/02/17 14:51, David Gibson wrote:
> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>> For the emulated devices it does not matter much if we get a broken TCE
>> half way handling a TCE list but for VFIO it will matter as it has
>> more chances to fail so we try to do our best and check as much as we
>> can before proceeding.
>>
>> This separates a guest view table update from validation. No change in
>> behavior is expected.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 15df8ae627d9..9a7b7fca5e84 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +	}
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		if (get_user(tce, tces + i)) {
>> +			ret = H_TOO_HARD;
>> +			goto unlock_exit;
>> +		}
>> +		tce = be64_to_cpu(tce);
> 
> This doesn't look safe.  The contents of user memory could change
> between the two get_user()s, meaning that you're no longer guaranteed
> a TCE loaded into kernel has been validated at all.
> 
> I think you need to either:
> 
>     a) Make sure things safe against a bad TCE being loaded into a TCE
>     table and move all validation to where the TCE is used, rather
>     than loaded
> 
> or
>     b) Copy the whole set of indirect entries to a temporary in-kernel
>        buffer, then validate, then load into the actual TCE table.


Correct :( The problem is I do not know how far I want to go in reverting
the state as it was when I started handling H_PUT_TCE_INDIRECT.

For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
2 tables, 512 TCEs request and TCE#100 does not translate to host physical
address.


To do a) I'll need to remember old content of each hardware table entry as
when I reach TCE#100, I'll need to revert to the initial state which means
I need to write back old TCEs to all affected hardware tables and update
reference counters of all affected preregistered areas. Well, the actual
tables must not have different addresses (BUG_ON? is it worth testing while
writing to hardware tables that values I am replacing are the same in all
tables?) so I can have just a single array of old TCEs from hardware tables
in vcpu.


To do b) I'll need:

1. to have a copy of TCEs from the guest in vcpu, I populate it via
get_user() to make sure they won't change;
2. an array of userspace addresses translated from given TCEs; and in order
to make sure these addresses won't go away, I'll need to reference each
preregistered memory area via mm_iommu_mapped_inc().

When I reach TCE#100, I'll have to revert the change, i.e. call
mm_iommu_mapped_dec().

So I will end up having 2 arrays in a vcpu and simpler reverting code.


Or I can do simpler version of b) which would store guest TCEs in
kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
request, some preregistered regions will stay referenced till the guest is
killed or rebooted (and this will prevent memory from unregistering) - but
this means no harm to the host; and with preregistered RAM, there is no
valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.



Which approach to pick?


LoPAPR says:
===
If the TCE parameter represents the logical page address of a page that is
not valid for the calling partition, return
H_Parameter.
===



>>  
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 918af76ab2b6..f8a54b7c788e 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS;
>> -	unsigned long tces, entry, ua = 0;
>> +	unsigned long tces, entry, tce, ua = 0;
>>  	unsigned long *rmap = NULL;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>  
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +	}
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> 
> Same problem here.
> 
>>  
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-09  8:20       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-09  8:20 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 5068 bytes --]

On 09/02/17 14:51, David Gibson wrote:
> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>> For the emulated devices it does not matter much if we get a broken TCE
>> half way handling a TCE list but for VFIO it will matter as it has
>> more chances to fail so we try to do our best and check as much as we
>> can before proceeding.
>>
>> This separates a guest view table update from validation. No change in
>> behavior is expected.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 15df8ae627d9..9a7b7fca5e84 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +	}
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		if (get_user(tce, tces + i)) {
>> +			ret = H_TOO_HARD;
>> +			goto unlock_exit;
>> +		}
>> +		tce = be64_to_cpu(tce);
> 
> This doesn't look safe.  The contents of user memory could change
> between the two get_user()s, meaning that you're no longer guaranteed
> a TCE loaded into kernel has been validated at all.
> 
> I think you need to either:
> 
>     a) Make sure things safe against a bad TCE being loaded into a TCE
>     table and move all validation to where the TCE is used, rather
>     than loaded
> 
> or
>     b) Copy the whole set of indirect entries to a temporary in-kernel
>        buffer, then validate, then load into the actual TCE table.


Correct :( The problem is I do not know how far I want to go in reverting
the state as it was when I started handling H_PUT_TCE_INDIRECT.

For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
2 tables, 512 TCEs request and TCE#100 does not translate to host physical
address.


To do a) I'll need to remember old content of each hardware table entry as
when I reach TCE#100, I'll need to revert to the initial state which means
I need to write back old TCEs to all affected hardware tables and update
reference counters of all affected preregistered areas. Well, the actual
tables must not have different addresses (BUG_ON? is it worth testing while
writing to hardware tables that values I am replacing are the same in all
tables?) so I can have just a single array of old TCEs from hardware tables
in vcpu.


To do b) I'll need:

1. to have a copy of TCEs from the guest in vcpu, I populate it via
get_user() to make sure they won't change;
2. an array of userspace addresses translated from given TCEs; and in order
to make sure these addresses won't go away, I'll need to reference each
preregistered memory area via mm_iommu_mapped_inc().

When I reach TCE#100, I'll have to revert the change, i.e. call
mm_iommu_mapped_dec().

So I will end up having 2 arrays in a vcpu and simpler reverting code.


Or I can do simpler version of b) which would store guest TCEs in
kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
request, some preregistered regions will stay referenced till the guest is
killed or rebooted (and this will prevent memory from unregistering) - but
this means no harm to the host; and with preregistered RAM, there is no
valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.



Which approach to pick?


LoPAPR says:
===
If the TCE parameter represents the logical page address of a page that is
not valid for the calling partition, return
H_Parameter.
===



>>  
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 918af76ab2b6..f8a54b7c788e 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS;
>> -	unsigned long tces, entry, ua = 0;
>> +	unsigned long tces, entry, tce, ua = 0;
>>  	unsigned long *rmap = NULL;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>  
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +	}
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> 
> Same problem here.
> 
>>  
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-02-09  6:41     ` David Gibson
  (?)
@ 2017-02-10  2:50       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-10  2:50 UTC (permalink / raw)
  To: David Gibson; +Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc


[-- Attachment #1.1: Type: text/plain, Size: 36911 bytes --]

On 09/02/17 17:41, David Gibson wrote:
> On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed later
>> ---
>>
>> This has separate copies of handlers for real and virtual modes as
>> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
>> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
>> in virtual mode and direct access in real mode and having a common
>> helper for it would make things uglier imho.
>>
>>
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            |  60 ++++++
>>  8 files changed, 590 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__u32	flags;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +	@flags are not supported now, must be zero;
>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index e59b172666cd..a827006941f8 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct vfio_group *group;
>> +	struct iommu_table *tbl;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 37bc9e7e90ba..da1410bd6b36 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group);
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index a2c9bb5a0ead..cdfa01169bd2 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +};
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 9a7b7fca5e84..cb0469151e35 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,10 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>> +#include <linux/file.h>
>> +#include <linux/vfio.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -39,6 +43,36 @@
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>> +
>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>> +{
>> +	void (*fn)(struct vfio_group *);
>> +
>> +	fn = symbol_get(vfio_group_put_external_user);
>> +	if (WARN_ON(!fn))
>> +		return;
>> +
>> +	fn(vfio_group);
>> +
>> +	symbol_put(vfio_group_put_external_user);
>> +}
>> +
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>>  
>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>  {
>> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>> +
>> +	iommu_table_put(stit->tbl);
>> +	kvm_vfio_group_put_external_user(stit->group);
>> +
>> +	kfree(stit);
>> +}
>> +
>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>> +		struct kvmppc_spapr_tce_table *stt,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>> +
>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>> +		if (group && (stit->group != group))
>> +			continue;
>> +
>> +		list_del_rcu(&stit->next);
>> +
>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>> +	}
>> +}
>> +
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>> +}
>> +
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	bool found = false;
>> +	struct iommu_table *tbl = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i, ret = 0;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	struct fd f;
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (!grp)
>> +		return -EFAULT;
> 
> EFAULT doesn't look right, that's usually means userspace has give us
> a bad address.  What does failure to look up the iommu group by id
> mean here?


iommu_group_get_by_id() can fail -
1. if "something went very wrong" - as group ids are allocated when devices
are discovered so they are pretty static;
2. there is some racy sriov disable or host pci hotunplug;
3. kvm_vfio_external_user_iommu_id() returned invalid group id which means
that a device was unbound from the vfio-pci driver but the caller holds a
reference to vfio_group so this should not happen.


> 
>> +
>> +	f = fdget(tablefd);
>> +	if (!f.file) {
>> +		ret = -EBADF;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt == f.file->private_data) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	fdput(f);
>> +
>> +	if (!found) {
>> +		ret = -ENODEV;
> 
> ENODEV doesn't look right either.  That generally means you're trying
> to use a device or facility that doesn't exist.  This case just means
> you've passed a file handle that either isn't a TCE table at all, or
> os one associated with a different VM.  -EINVAL, I guess, overloaded
> as it is.

Ok.



> 
>> +		goto put_exit;
> 
> Don't you need to put the table fd as well as the iommu group which
> you put in that exit path?


It is put few lines above.


>> +	}
>> +
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (WARN_ON(!table_group)) {
>> +		ret = -EFAULT;
>> +		goto put_exit;
> 
> Again don't you need to put the table fd as well.

It is put few lines above, I do not keep it open longer than needed.


> 
>> +	}
>> +
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbltmp = table_group->tables[i];
>> +
>> +		if (!tbltmp)
>> +			continue;
>> +
>> +		/*
>> +		 * Make sure hardware table parameters are exactly the same;
>> +		 * this is used in the TCE handlers where boundary checks
>> +		 * use only the first attached table.
>> +		 */
>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>> +				(tbltmp->it_offset == stt->offset) &&
>> +				(tbltmp->it_size == stt->size)) {
>> +			tbl = tbltmp;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbl) {
>> +		ret = -ENODEV;
> 
> Again, ENODEV doesn't seem right.  Here the problem is that the host
> hardware constraints don't match the guest hardware constraints.
> Hmm.  EIO?  ENOSPC?


Neither is very appealing to me... EINVAL?
When I use "ENODEV", I am thinking of "there is no device with
expected/requested characteristics" but this is probably wrong.



> 
>> +		goto put_exit;
>> +	}
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>> +	stit->tbl = tbl;
>> +	stit->group = group;
>> +
>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> 
> So if you add the same group to the same liobn multiple times, you'll
> get multiple identical entries in this list.
> 
> I guess that's mostly harmless... although.. does it allow the user to
> force the allocation of arbitrary amounts of kernel memory in that
> list?


Oh. No, I'll add a check to avoid duplicates, they do not make sense here.


> 
>> +put_exit:
>> +	iommu_group_put(grp);
>> +
>> +	return ret;
>> +}
>> +
>>  static void release_spapr_tce_table(struct rcu_head *head)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>  
>>  	list_del_rcu(&stt->list);
>>  
>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>> +
>>  	kvm_put_kvm(stt->kvm);
>>  
>>  	kvmppc_account_memlimit(
>> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	stt->offset = args->offset;
>>  	stt->size = size;
>>  	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>  
>>  	for (i = 0; i < npages; i++) {
>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
> 
> What could trigger this error?  Should it be a WARN_ON?

Nothing should so yes, it can be WARN_ON.


> 
>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret != H_SUCCESS)
>> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>> +		return H_HARDWARE;
> 
> IIUC this would happen if qemu had failed to preregister all of guest
> RAM, making this indeed an H_HARDWARE.


If QEMU failed to preregister, then mm_iommu_lookup() fails and it is
TOO_HARD. mm_iommu_ua_to_hpa() in this context cannot possibly fail (unless
broken memory) as it only returns error when out of bounds but
mm_iommu_lookup() ensures this.



> 
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
> 
> I'm less clear on when this one would happen.


This may happen when there is a race with mm_iommu_put().


> 
>> +
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		      unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>> -	long ret;
>> +	long ret, idx;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, gpa;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		entry = ioba >> stit->tbl->it_page_shift;
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +		dir = iommu_tce_direction(tce);
>> +
>> +		if (dir == DMA_NONE) {
>> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
>> +				return H_PARAMETER;
>> +		} else {
>> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> 
> Any way you could make these param check functions based on stt
> instead of stit->tbl?  That would let you do them before checking if
> there are any hw tables to update, avaoiding the somewhat awkward
> 	if (at least one)
> 		for (each one)
> construct.

I could:
1. change iommu_tce_put_param_check() to take shift, offset, size and drop
use of IOMMU_PAGE_MASK(tbl) (and change all callers in vfio_iommu_spapr_tce.c);
2. make a copy of iommu_tce_put_param_check() which would take stt.

And yet this code does operate with tbl anyway, akward either way imho...



> 
>> +				return H_PARAMETER;
>> +		}
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			if (dir == DMA_NONE) {
>> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry);
>> +			} else {
>> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
>> +						entry, gpa, dir);
>> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +			}
>> +			if (ret != H_SUCCESS)
>> +				return ret;
> 
> Doesn't this error path need to clean up for the case where you
> managed to update some backing TCE tables, but then failed later ones?

Probably.

This is what I asked in:
Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

Failure to update a hardware TCE table means we are in deep trouble, I
cannot think of any valid reason how we could get this far and not fail
before but fail now.


> 
>> +		}
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS, idx;
>> -	unsigned long entry, ua = 0;
>> +	unsigned long entry, gpa, ua = 0;
>>  	u64 __user *tces;
>>  	u64 tce;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  	tces = (u64 __user *) ua;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +
>>  	for (i = 0; i < npages; ++i) {
>>  		if (get_user(tce, tces + i)) {
>>  			ret = H_TOO_HARD;
>> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +
>> +		if (stit) {
>> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +			ret = iommu_tce_put_param_check(stit->tbl,
>> +					ioba + (i << stit->tbl->it_page_shift),
>> +					gpa);
>> +			if (ret != H_SUCCESS)
>> +				goto unlock_exit;
>> +		}
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		}
>>  		tce = be64_to_cpu(tce);
>>  
>> +		if (stit) {
>> +			for (i = 0; i < npages; ++i) {
>> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +				list_for_each_entry_lockless(stit,
>> +						&stt->iommu_tables, next) {
>> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
>> +						stit->tbl, entry + i, gpa,
>> +						iommu_tce_direction(tce));
>> +					if (ret != H_SUCCESS)
>> +						goto unlock_exit;
>> +				}
> 
> Um.. what value will this for_each leave in stit after completion?  I
> suspect it will be something bogus, which means re-using stit in the
> next 0..npages loop iteration won't be safe (you only initialize stit
> with the first entry outside that loop).


#define list_for_each_entry_lockless(pos, head, member) \
  for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
     &pos->member != (head); \
     pos = list_entry_lockless(pos->member.next, typeof(*pos), member))

stit is "pos" which is reset every time the loop is called.


> 
>> +			}
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
>> +					tce_value, npages))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +			for (i = 0; i < npages; ++i) {
>> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry + i);
>> +				if (ret)
>> +					return ret;
> 
> Again do you need some sort of cleanup for partial completion?

Again,
Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

This is an unexpected failure which should not happen, what kind of cleanup
it would make sense to do here? Re-map what was mapped before H_STUFF_TCE
was called?


> 
> 
>> +			}
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index dc1c66fda941..018c7d94a575 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>  
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret)
>> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa = 0, ua;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_HARDWARE;
> 
> What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?


When kernel memory gets corrupted and vmalloc_to_page() won't be able to
find a page which was allocated with vmalloc.


>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
>> +
>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
>> +
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, gpa;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		entry = ioba >> stit->tbl->it_page_shift;
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +		dir = iommu_tce_direction(tce);
>> +
>> +		if (dir == DMA_NONE) {
>> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
>> +				return H_PARAMETER;
>> +		} else {
>> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
>> +				return H_PARAMETER;
>> +		}
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			if (dir == DMA_NONE)
>> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry);
>> +			else
>> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
>> +						entry, gpa, dir);
>> +			if (ret != H_SUCCESS)
>> +				return ret;
>> +		}
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS;
>> -	unsigned long tces, entry, tce, ua = 0;
>> +	unsigned long tces, entry, gpa, tce, ua = 0;
>>  	unsigned long *rmap = NULL;
>>  	bool prereg = false;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		}
>>  	}
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +
>>  	for (i = 0; i < npages; ++i) {
>>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>>  
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +
>> +		if (stit) {
>> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +			ret = iommu_tce_put_param_check(stit->tbl,
>> +					ioba + (i << stit->tbl->it_page_shift),
>> +					gpa);
>> +			if (ret != H_SUCCESS)
>> +				goto unlock_exit;
>> +
>> +		}
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> 
> As noted in the earlier patch this is really dangerous - by reloading
> the tce from userspace you've thrown away the verification above.


Sure, I am adding a tces cache to kvm_vcpu.


>> +		if (stit) {
>> +			for (i = 0; i < npages; ++i) {
>> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +				list_for_each_entry_lockless(stit,
>> +						&stt->iommu_tables, next) {
>> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
>> +						stit->tbl, entry + i, gpa,
>> +						iommu_tce_direction(tce));
>> +					if (ret != H_SUCCESS)
>> +						goto unlock_exit;
>> +				}
>> +			}
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
>> +					tce_value, npages))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +			for (i = 0; i < npages; ++i) {
>> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry + i);
>> +				if (ret)
>> +					return ret;
>> +			}
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index cd892dec7cb6..f3127dc87912 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  	case KVM_CAP_SPAPR_TCE:
>>  	case KVM_CAP_SPAPR_TCE_64:
>> +		/* fallthrough */
> 
> I'm not sure why this one should get a fallthrough comment, when none
> of the other cases do.


I believe it was either ignored then or checkpatch.pl did not warn about
this at the time.


> 
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>  	case KVM_CAP_PPC_RTAS:
>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index d32f239eb471..2b7dc22265fe 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  
>>  		mutex_unlock(&kv->lock);
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>  
>>  		kvm_vfio_group_put_external_user(vfio_group);
>> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  		kvm_vfio_update_coherency(dev);
>>  
>>  		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>> +		struct kvm_vfio_spapr_tce param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz || param.flags)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.groupfd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
> 
> 
> Is there any particular reason you unwrap the group fd here, but the
> table fd inside kvm__spapr_tce_attach_iommu_group()?

No particular reason, just an intention not to spread too much spapr to KVM
VFIO device and vfio_group to POWER KVM.

I only unwrapp table_fd to see if it is in the kvm->arch.spapr_tce_tables
list, I am trying to keep spapr_tce_tables and kvmppc_spapr_tce_iommu_table
local to arch/powerpc/kvm/book3s_64_vio*.c

Unwrapping groupfd in arch/powerpc/kvm/book3s_64_vio*.c would mean
duplicating all kvm_vfio_group_get_external_user()/etc stubs in
arch/powerpc/kvm/book3s_64_vio.c, I did not want to duplicate these stubs.
I could but since I already have vfio_group unwrapped here, it seems
pointless to unwrap it over again in arch/powerpc/kvm/book3s_64_vio.c,
should I?



> 
>> +		ret = -ENOENT;
>> +
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.tablefd, vfio_group);
>> +
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>  	}
>>  
>>  	return -ENXIO;
>> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>  		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>> +#endif
>>  			return 0;
>>  		}
>>  
>> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>  	struct kvm_vfio_group *kvg, *tmp;
>>  
>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>  		list_del(&kvg->node);
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-02-10  2:50       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-10  2:50 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 36911 bytes --]

On 09/02/17 17:41, David Gibson wrote:
> On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed later
>> ---
>>
>> This has separate copies of handlers for real and virtual modes as
>> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
>> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
>> in virtual mode and direct access in real mode and having a common
>> helper for it would make things uglier imho.
>>
>>
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            |  60 ++++++
>>  8 files changed, 590 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__u32	flags;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +	@flags are not supported now, must be zero;
>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index e59b172666cd..a827006941f8 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct vfio_group *group;
>> +	struct iommu_table *tbl;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 37bc9e7e90ba..da1410bd6b36 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group);
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index a2c9bb5a0ead..cdfa01169bd2 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +};
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 9a7b7fca5e84..cb0469151e35 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,10 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>> +#include <linux/file.h>
>> +#include <linux/vfio.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -39,6 +43,36 @@
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>> +
>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>> +{
>> +	void (*fn)(struct vfio_group *);
>> +
>> +	fn = symbol_get(vfio_group_put_external_user);
>> +	if (WARN_ON(!fn))
>> +		return;
>> +
>> +	fn(vfio_group);
>> +
>> +	symbol_put(vfio_group_put_external_user);
>> +}
>> +
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>>  
>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>  {
>> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>> +
>> +	iommu_table_put(stit->tbl);
>> +	kvm_vfio_group_put_external_user(stit->group);
>> +
>> +	kfree(stit);
>> +}
>> +
>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>> +		struct kvmppc_spapr_tce_table *stt,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>> +
>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>> +		if (group && (stit->group != group))
>> +			continue;
>> +
>> +		list_del_rcu(&stit->next);
>> +
>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>> +	}
>> +}
>> +
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>> +}
>> +
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	bool found = false;
>> +	struct iommu_table *tbl = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i, ret = 0;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	struct fd f;
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (!grp)
>> +		return -EFAULT;
> 
> EFAULT doesn't look right, that's usually means userspace has give us
> a bad address.  What does failure to look up the iommu group by id
> mean here?


iommu_group_get_by_id() can fail -
1. if "something went very wrong" - as group ids are allocated when devices
are discovered so they are pretty static;
2. there is some racy sriov disable or host pci hotunplug;
3. kvm_vfio_external_user_iommu_id() returned invalid group id which means
that a device was unbound from the vfio-pci driver but the caller holds a
reference to vfio_group so this should not happen.


> 
>> +
>> +	f = fdget(tablefd);
>> +	if (!f.file) {
>> +		ret = -EBADF;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt == f.file->private_data) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	fdput(f);
>> +
>> +	if (!found) {
>> +		ret = -ENODEV;
> 
> ENODEV doesn't look right either.  That generally means you're trying
> to use a device or facility that doesn't exist.  This case just means
> you've passed a file handle that either isn't a TCE table at all, or
> os one associated with a different VM.  -EINVAL, I guess, overloaded
> as it is.

Ok.



> 
>> +		goto put_exit;
> 
> Don't you need to put the table fd as well as the iommu group which
> you put in that exit path?


It is put few lines above.


>> +	}
>> +
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (WARN_ON(!table_group)) {
>> +		ret = -EFAULT;
>> +		goto put_exit;
> 
> Again don't you need to put the table fd as well.

It is put few lines above, I do not keep it open longer than needed.


> 
>> +	}
>> +
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbltmp = table_group->tables[i];
>> +
>> +		if (!tbltmp)
>> +			continue;
>> +
>> +		/*
>> +		 * Make sure hardware table parameters are exactly the same;
>> +		 * this is used in the TCE handlers where boundary checks
>> +		 * use only the first attached table.
>> +		 */
>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>> +				(tbltmp->it_offset == stt->offset) &&
>> +				(tbltmp->it_size == stt->size)) {
>> +			tbl = tbltmp;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbl) {
>> +		ret = -ENODEV;
> 
> Again, ENODEV doesn't seem right.  Here the problem is that the host
> hardware constraints don't match the guest hardware constraints.
> Hmm.  EIO?  ENOSPC?


Neither is very appealing to me... EINVAL?
When I use "ENODEV", I am thinking of "there is no device with
expected/requested characteristics" but this is probably wrong.



> 
>> +		goto put_exit;
>> +	}
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>> +	stit->tbl = tbl;
>> +	stit->group = group;
>> +
>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> 
> So if you add the same group to the same liobn multiple times, you'll
> get multiple identical entries in this list.
> 
> I guess that's mostly harmless... although.. does it allow the user to
> force the allocation of arbitrary amounts of kernel memory in that
> list?


Oh. No, I'll add a check to avoid duplicates, they do not make sense here.


> 
>> +put_exit:
>> +	iommu_group_put(grp);
>> +
>> +	return ret;
>> +}
>> +
>>  static void release_spapr_tce_table(struct rcu_head *head)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>  
>>  	list_del_rcu(&stt->list);
>>  
>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>> +
>>  	kvm_put_kvm(stt->kvm);
>>  
>>  	kvmppc_account_memlimit(
>> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	stt->offset = args->offset;
>>  	stt->size = size;
>>  	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>  
>>  	for (i = 0; i < npages; i++) {
>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
> 
> What could trigger this error?  Should it be a WARN_ON?

Nothing should so yes, it can be WARN_ON.


> 
>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret != H_SUCCESS)
>> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>> +		return H_HARDWARE;
> 
> IIUC this would happen if qemu had failed to preregister all of guest
> RAM, making this indeed an H_HARDWARE.


If QEMU failed to preregister, then mm_iommu_lookup() fails and it is
TOO_HARD. mm_iommu_ua_to_hpa() in this context cannot possibly fail (unless
broken memory) as it only returns error when out of bounds but
mm_iommu_lookup() ensures this.



> 
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
> 
> I'm less clear on when this one would happen.


This may happen when there is a race with mm_iommu_put().


> 
>> +
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		      unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>> -	long ret;
>> +	long ret, idx;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, gpa;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		entry = ioba >> stit->tbl->it_page_shift;
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +		dir = iommu_tce_direction(tce);
>> +
>> +		if (dir == DMA_NONE) {
>> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
>> +				return H_PARAMETER;
>> +		} else {
>> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> 
> Any way you could make these param check functions based on stt
> instead of stit->tbl?  That would let you do them before checking if
> there are any hw tables to update, avaoiding the somewhat awkward
> 	if (at least one)
> 		for (each one)
> construct.

I could:
1. change iommu_tce_put_param_check() to take shift, offset, size and drop
use of IOMMU_PAGE_MASK(tbl) (and change all callers in vfio_iommu_spapr_tce.c);
2. make a copy of iommu_tce_put_param_check() which would take stt.

And yet this code does operate with tbl anyway, akward either way imho...



> 
>> +				return H_PARAMETER;
>> +		}
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			if (dir == DMA_NONE) {
>> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry);
>> +			} else {
>> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
>> +						entry, gpa, dir);
>> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +			}
>> +			if (ret != H_SUCCESS)
>> +				return ret;
> 
> Doesn't this error path need to clean up for the case where you
> managed to update some backing TCE tables, but then failed later ones?

Probably.

This is what I asked in:
Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

Failure to update a hardware TCE table means we are in deep trouble, I
cannot think of any valid reason how we could get this far and not fail
before but fail now.


> 
>> +		}
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS, idx;
>> -	unsigned long entry, ua = 0;
>> +	unsigned long entry, gpa, ua = 0;
>>  	u64 __user *tces;
>>  	u64 tce;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  	tces = (u64 __user *) ua;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +
>>  	for (i = 0; i < npages; ++i) {
>>  		if (get_user(tce, tces + i)) {
>>  			ret = H_TOO_HARD;
>> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +
>> +		if (stit) {
>> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +			ret = iommu_tce_put_param_check(stit->tbl,
>> +					ioba + (i << stit->tbl->it_page_shift),
>> +					gpa);
>> +			if (ret != H_SUCCESS)
>> +				goto unlock_exit;
>> +		}
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		}
>>  		tce = be64_to_cpu(tce);
>>  
>> +		if (stit) {
>> +			for (i = 0; i < npages; ++i) {
>> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +				list_for_each_entry_lockless(stit,
>> +						&stt->iommu_tables, next) {
>> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
>> +						stit->tbl, entry + i, gpa,
>> +						iommu_tce_direction(tce));
>> +					if (ret != H_SUCCESS)
>> +						goto unlock_exit;
>> +				}
> 
> Um.. what value will this for_each leave in stit after completion?  I
> suspect it will be something bogus, which means re-using stit in the
> next 0..npages loop iteration won't be safe (you only initialize stit
> with the first entry outside that loop).


#define list_for_each_entry_lockless(pos, head, member) \
  for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
     &pos->member != (head); \
     pos = list_entry_lockless(pos->member.next, typeof(*pos), member))

stit is "pos" which is reset every time the loop is called.


> 
>> +			}
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
>> +					tce_value, npages))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +			for (i = 0; i < npages; ++i) {
>> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry + i);
>> +				if (ret)
>> +					return ret;
> 
> Again do you need some sort of cleanup for partial completion?

Again,
Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

This is an unexpected failure which should not happen, what kind of cleanup
it would make sense to do here? Re-map what was mapped before H_STUFF_TCE
was called?


> 
> 
>> +			}
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index dc1c66fda941..018c7d94a575 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>  
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret)
>> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa = 0, ua;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_HARDWARE;
> 
> What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?


When kernel memory gets corrupted and vmalloc_to_page() won't be able to
find a page which was allocated with vmalloc.


>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
>> +
>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
>> +
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, gpa;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		entry = ioba >> stit->tbl->it_page_shift;
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +		dir = iommu_tce_direction(tce);
>> +
>> +		if (dir == DMA_NONE) {
>> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
>> +				return H_PARAMETER;
>> +		} else {
>> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
>> +				return H_PARAMETER;
>> +		}
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			if (dir == DMA_NONE)
>> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry);
>> +			else
>> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
>> +						entry, gpa, dir);
>> +			if (ret != H_SUCCESS)
>> +				return ret;
>> +		}
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS;
>> -	unsigned long tces, entry, tce, ua = 0;
>> +	unsigned long tces, entry, gpa, tce, ua = 0;
>>  	unsigned long *rmap = NULL;
>>  	bool prereg = false;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		}
>>  	}
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +
>>  	for (i = 0; i < npages; ++i) {
>>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>>  
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +
>> +		if (stit) {
>> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +			ret = iommu_tce_put_param_check(stit->tbl,
>> +					ioba + (i << stit->tbl->it_page_shift),
>> +					gpa);
>> +			if (ret != H_SUCCESS)
>> +				goto unlock_exit;
>> +
>> +		}
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> 
> As noted in the earlier patch this is really dangerous - by reloading
> the tce from userspace you've thrown away the verification above.


Sure, I am adding a tces cache to kvm_vcpu.


>> +		if (stit) {
>> +			for (i = 0; i < npages; ++i) {
>> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +				list_for_each_entry_lockless(stit,
>> +						&stt->iommu_tables, next) {
>> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
>> +						stit->tbl, entry + i, gpa,
>> +						iommu_tce_direction(tce));
>> +					if (ret != H_SUCCESS)
>> +						goto unlock_exit;
>> +				}
>> +			}
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
>> +					tce_value, npages))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +			for (i = 0; i < npages; ++i) {
>> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry + i);
>> +				if (ret)
>> +					return ret;
>> +			}
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index cd892dec7cb6..f3127dc87912 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  	case KVM_CAP_SPAPR_TCE:
>>  	case KVM_CAP_SPAPR_TCE_64:
>> +		/* fallthrough */
> 
> I'm not sure why this one should get a fallthrough comment, when none
> of the other cases do.


I believe it was either ignored then or checkpatch.pl did not warn about
this at the time.


> 
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>  	case KVM_CAP_PPC_RTAS:
>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index d32f239eb471..2b7dc22265fe 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  
>>  		mutex_unlock(&kv->lock);
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>  
>>  		kvm_vfio_group_put_external_user(vfio_group);
>> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  		kvm_vfio_update_coherency(dev);
>>  
>>  		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>> +		struct kvm_vfio_spapr_tce param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz || param.flags)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.groupfd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
> 
> 
> Is there any particular reason you unwrap the group fd here, but the
> table fd inside kvm__spapr_tce_attach_iommu_group()?

No particular reason, just an intention not to spread too much spapr to KVM
VFIO device and vfio_group to POWER KVM.

I only unwrapp table_fd to see if it is in the kvm->arch.spapr_tce_tables
list, I am trying to keep spapr_tce_tables and kvmppc_spapr_tce_iommu_table
local to arch/powerpc/kvm/book3s_64_vio*.c

Unwrapping groupfd in arch/powerpc/kvm/book3s_64_vio*.c would mean
duplicating all kvm_vfio_group_get_external_user()/etc stubs in
arch/powerpc/kvm/book3s_64_vio.c, I did not want to duplicate these stubs.
I could but since I already have vfio_group unwrapped here, it seems
pointless to unwrap it over again in arch/powerpc/kvm/book3s_64_vio.c,
should I?



> 
>> +		ret = -ENOENT;
>> +
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.tablefd, vfio_group);
>> +
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>  	}
>>  
>>  	return -ENXIO;
>> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>  		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>> +#endif
>>  			return 0;
>>  		}
>>  
>> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>  	struct kvm_vfio_group *kvg, *tmp;
>>  
>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>  		list_del(&kvg->node);
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-02-10  2:50       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-10  2:50 UTC (permalink / raw)
  To: David Gibson; +Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc


[-- Attachment #1.1: Type: text/plain, Size: 36911 bytes --]

On 09/02/17 17:41, David Gibson wrote:
> On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed later
>> ---
>>
>> This has separate copies of handlers for real and virtual modes as
>> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
>> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
>> in virtual mode and direct access in real mode and having a common
>> helper for it would make things uglier imho.
>>
>>
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            |  60 ++++++
>>  8 files changed, 590 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__u32	flags;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +	@flags are not supported now, must be zero;
>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index e59b172666cd..a827006941f8 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct vfio_group *group;
>> +	struct iommu_table *tbl;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 37bc9e7e90ba..da1410bd6b36 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group);
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index a2c9bb5a0ead..cdfa01169bd2 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +};
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 9a7b7fca5e84..cb0469151e35 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,10 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>> +#include <linux/file.h>
>> +#include <linux/vfio.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -39,6 +43,36 @@
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>> +
>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>> +{
>> +	void (*fn)(struct vfio_group *);
>> +
>> +	fn = symbol_get(vfio_group_put_external_user);
>> +	if (WARN_ON(!fn))
>> +		return;
>> +
>> +	fn(vfio_group);
>> +
>> +	symbol_put(vfio_group_put_external_user);
>> +}
>> +
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>>  
>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>  {
>> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>> +
>> +	iommu_table_put(stit->tbl);
>> +	kvm_vfio_group_put_external_user(stit->group);
>> +
>> +	kfree(stit);
>> +}
>> +
>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>> +		struct kvmppc_spapr_tce_table *stt,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>> +
>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>> +		if (group && (stit->group != group))
>> +			continue;
>> +
>> +		list_del_rcu(&stit->next);
>> +
>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>> +	}
>> +}
>> +
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>> +}
>> +
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	bool found = false;
>> +	struct iommu_table *tbl = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i, ret = 0;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	struct fd f;
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (!grp)
>> +		return -EFAULT;
> 
> EFAULT doesn't look right, that's usually means userspace has give us
> a bad address.  What does failure to look up the iommu group by id
> mean here?


iommu_group_get_by_id() can fail -
1. if "something went very wrong" - as group ids are allocated when devices
are discovered so they are pretty static;
2. there is some racy sriov disable or host pci hotunplug;
3. kvm_vfio_external_user_iommu_id() returned invalid group id which means
that a device was unbound from the vfio-pci driver but the caller holds a
reference to vfio_group so this should not happen.


> 
>> +
>> +	f = fdget(tablefd);
>> +	if (!f.file) {
>> +		ret = -EBADF;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt == f.file->private_data) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	fdput(f);
>> +
>> +	if (!found) {
>> +		ret = -ENODEV;
> 
> ENODEV doesn't look right either.  That generally means you're trying
> to use a device or facility that doesn't exist.  This case just means
> you've passed a file handle that either isn't a TCE table at all, or
> os one associated with a different VM.  -EINVAL, I guess, overloaded
> as it is.

Ok.



> 
>> +		goto put_exit;
> 
> Don't you need to put the table fd as well as the iommu group which
> you put in that exit path?


It is put few lines above.


>> +	}
>> +
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (WARN_ON(!table_group)) {
>> +		ret = -EFAULT;
>> +		goto put_exit;
> 
> Again don't you need to put the table fd as well.

It is put few lines above, I do not keep it open longer than needed.


> 
>> +	}
>> +
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbltmp = table_group->tables[i];
>> +
>> +		if (!tbltmp)
>> +			continue;
>> +
>> +		/*
>> +		 * Make sure hardware table parameters are exactly the same;
>> +		 * this is used in the TCE handlers where boundary checks
>> +		 * use only the first attached table.
>> +		 */
>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>> +				(tbltmp->it_offset == stt->offset) &&
>> +				(tbltmp->it_size == stt->size)) {
>> +			tbl = tbltmp;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbl) {
>> +		ret = -ENODEV;
> 
> Again, ENODEV doesn't seem right.  Here the problem is that the host
> hardware constraints don't match the guest hardware constraints.
> Hmm.  EIO?  ENOSPC?


Neither is very appealing to me... EINVAL?
When I use "ENODEV", I am thinking of "there is no device with
expected/requested characteristics" but this is probably wrong.



> 
>> +		goto put_exit;
>> +	}
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>> +	stit->tbl = tbl;
>> +	stit->group = group;
>> +
>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> 
> So if you add the same group to the same liobn multiple times, you'll
> get multiple identical entries in this list.
> 
> I guess that's mostly harmless... although.. does it allow the user to
> force the allocation of arbitrary amounts of kernel memory in that
> list?


Oh. No, I'll add a check to avoid duplicates, they do not make sense here.


> 
>> +put_exit:
>> +	iommu_group_put(grp);
>> +
>> +	return ret;
>> +}
>> +
>>  static void release_spapr_tce_table(struct rcu_head *head)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>  
>>  	list_del_rcu(&stt->list);
>>  
>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>> +
>>  	kvm_put_kvm(stt->kvm);
>>  
>>  	kvmppc_account_memlimit(
>> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	stt->offset = args->offset;
>>  	stt->size = size;
>>  	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>  
>>  	for (i = 0; i < npages; i++) {
>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
> 
> What could trigger this error?  Should it be a WARN_ON?

Nothing should so yes, it can be WARN_ON.


> 
>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret != H_SUCCESS)
>> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>> +		return H_HARDWARE;
> 
> IIUC this would happen if qemu had failed to preregister all of guest
> RAM, making this indeed an H_HARDWARE.


If QEMU failed to preregister, then mm_iommu_lookup() fails and it is
TOO_HARD. mm_iommu_ua_to_hpa() in this context cannot possibly fail (unless
broken memory) as it only returns error when out of bounds but
mm_iommu_lookup() ensures this.



> 
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
> 
> I'm less clear on when this one would happen.


This may happen when there is a race with mm_iommu_put().


> 
>> +
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		      unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>> -	long ret;
>> +	long ret, idx;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, gpa;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		entry = ioba >> stit->tbl->it_page_shift;
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +		dir = iommu_tce_direction(tce);
>> +
>> +		if (dir == DMA_NONE) {
>> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
>> +				return H_PARAMETER;
>> +		} else {
>> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> 
> Any way you could make these param check functions based on stt
> instead of stit->tbl?  That would let you do them before checking if
> there are any hw tables to update, avaoiding the somewhat awkward
> 	if (at least one)
> 		for (each one)
> construct.

I could:
1. change iommu_tce_put_param_check() to take shift, offset, size and drop
use of IOMMU_PAGE_MASK(tbl) (and change all callers in vfio_iommu_spapr_tce.c);
2. make a copy of iommu_tce_put_param_check() which would take stt.

And yet this code does operate with tbl anyway, akward either way imho...



> 
>> +				return H_PARAMETER;
>> +		}
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			if (dir == DMA_NONE) {
>> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry);
>> +			} else {
>> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
>> +						entry, gpa, dir);
>> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +			}
>> +			if (ret != H_SUCCESS)
>> +				return ret;
> 
> Doesn't this error path need to clean up for the case where you
> managed to update some backing TCE tables, but then failed later ones?

Probably.

This is what I asked in:
Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

Failure to update a hardware TCE table means we are in deep trouble, I
cannot think of any valid reason how we could get this far and not fail
before but fail now.


> 
>> +		}
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS, idx;
>> -	unsigned long entry, ua = 0;
>> +	unsigned long entry, gpa, ua = 0;
>>  	u64 __user *tces;
>>  	u64 tce;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  	tces = (u64 __user *) ua;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +
>>  	for (i = 0; i < npages; ++i) {
>>  		if (get_user(tce, tces + i)) {
>>  			ret = H_TOO_HARD;
>> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +
>> +		if (stit) {
>> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +			ret = iommu_tce_put_param_check(stit->tbl,
>> +					ioba + (i << stit->tbl->it_page_shift),
>> +					gpa);
>> +			if (ret != H_SUCCESS)
>> +				goto unlock_exit;
>> +		}
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		}
>>  		tce = be64_to_cpu(tce);
>>  
>> +		if (stit) {
>> +			for (i = 0; i < npages; ++i) {
>> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +				list_for_each_entry_lockless(stit,
>> +						&stt->iommu_tables, next) {
>> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
>> +						stit->tbl, entry + i, gpa,
>> +						iommu_tce_direction(tce));
>> +					if (ret != H_SUCCESS)
>> +						goto unlock_exit;
>> +				}
> 
> Um.. what value will this for_each leave in stit after completion?  I
> suspect it will be something bogus, which means re-using stit in the
> next 0..npages loop iteration won't be safe (you only initialize stit
> with the first entry outside that loop).


#define list_for_each_entry_lockless(pos, head, member) \
  for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
     &pos->member != (head); \
     pos = list_entry_lockless(pos->member.next, typeof(*pos), member))

stit is "pos" which is reset every time the loop is called.


> 
>> +			}
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
>> +					tce_value, npages))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +			for (i = 0; i < npages; ++i) {
>> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry + i);
>> +				if (ret)
>> +					return ret;
> 
> Again do you need some sort of cleanup for partial completion?

Again,
Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

This is an unexpected failure which should not happen, what kind of cleanup
it would make sense to do here? Re-map what was mapped before H_STUFF_TCE
was called?


> 
> 
>> +			}
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index dc1c66fda941..018c7d94a575 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>  
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret)
>> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa = 0, ua;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_HARDWARE;
> 
> What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?


When kernel memory gets corrupted and vmalloc_to_page() won't be able to
find a page which was allocated with vmalloc.


>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
>> +
>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
>> +
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, gpa;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		entry = ioba >> stit->tbl->it_page_shift;
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +		dir = iommu_tce_direction(tce);
>> +
>> +		if (dir == DMA_NONE) {
>> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
>> +				return H_PARAMETER;
>> +		} else {
>> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
>> +				return H_PARAMETER;
>> +		}
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			if (dir == DMA_NONE)
>> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry);
>> +			else
>> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
>> +						entry, gpa, dir);
>> +			if (ret != H_SUCCESS)
>> +				return ret;
>> +		}
>> +	}
>> +
>>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>  
>>  	return H_SUCCESS;
>> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS;
>> -	unsigned long tces, entry, tce, ua = 0;
>> +	unsigned long tces, entry, gpa, tce, ua = 0;
>>  	unsigned long *rmap = NULL;
>>  	bool prereg = false;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		}
>>  	}
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +
>>  	for (i = 0; i < npages; ++i) {
>>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>>  
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>> +
>> +		if (stit) {
>> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +			ret = iommu_tce_put_param_check(stit->tbl,
>> +					ioba + (i << stit->tbl->it_page_shift),
>> +					gpa);
>> +			if (ret != H_SUCCESS)
>> +				goto unlock_exit;
>> +
>> +		}
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> 
> As noted in the earlier patch this is really dangerous - by reloading
> the tce from userspace you've thrown away the verification above.


Sure, I am adding a tces cache to kvm_vcpu.


>> +		if (stit) {
>> +			for (i = 0; i < npages; ++i) {
>> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +				list_for_each_entry_lockless(stit,
>> +						&stt->iommu_tables, next) {
>> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
>> +						stit->tbl, entry + i, gpa,
>> +						iommu_tce_direction(tce));
>> +					if (ret != H_SUCCESS)
>> +						goto unlock_exit;
>> +				}
>> +			}
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	stit = list_first_entry_or_null(&stt->iommu_tables,
>> +			struct kvmppc_spapr_tce_iommu_table, next);
>> +	if (stit) {
>> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
>> +					tce_value, npages))
>> +			return H_PARAMETER;
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +			for (i = 0; i < npages; ++i) {
>> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +						stit->tbl, entry + i);
>> +				if (ret)
>> +					return ret;
>> +			}
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index cd892dec7cb6..f3127dc87912 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  	case KVM_CAP_SPAPR_TCE:
>>  	case KVM_CAP_SPAPR_TCE_64:
>> +		/* fallthrough */
> 
> I'm not sure why this one should get a fallthrough comment, when none
> of the other cases do.


I believe it was either ignored then or checkpatch.pl did not warn about
this at the time.


> 
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>  	case KVM_CAP_PPC_RTAS:
>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index d32f239eb471..2b7dc22265fe 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  
>>  		mutex_unlock(&kv->lock);
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>  
>>  		kvm_vfio_group_put_external_user(vfio_group);
>> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  		kvm_vfio_update_coherency(dev);
>>  
>>  		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>> +		struct kvm_vfio_spapr_tce param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz || param.flags)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.groupfd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
> 
> 
> Is there any particular reason you unwrap the group fd here, but the
> table fd inside kvm__spapr_tce_attach_iommu_group()?

No particular reason, just an intention not to spread too much spapr to KVM
VFIO device and vfio_group to POWER KVM.

I only unwrapp table_fd to see if it is in the kvm->arch.spapr_tce_tables
list, I am trying to keep spapr_tce_tables and kvmppc_spapr_tce_iommu_table
local to arch/powerpc/kvm/book3s_64_vio*.c

Unwrapping groupfd in arch/powerpc/kvm/book3s_64_vio*.c would mean
duplicating all kvm_vfio_group_get_external_user()/etc stubs in
arch/powerpc/kvm/book3s_64_vio.c, I did not want to duplicate these stubs.
I could but since I already have vfio_group unwrapped here, it seems
pointless to unwrap it over again in arch/powerpc/kvm/book3s_64_vio.c,
should I?



> 
>> +		ret = -ENOENT;
>> +
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.tablefd, vfio_group);
>> +
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>  	}
>>  
>>  	return -ENXIO;
>> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>  		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>> +#endif
>>  			return 0;
>>  		}
>>  
>> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>  	struct kvm_vfio_group *kvg, *tmp;
>>  
>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>  		list_del(&kvg->node);
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
  2017-02-09  8:20       ` Alexey Kardashevskiy
@ 2017-02-10  3:07         ` David Gibson
  -1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-10  3:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 7810 bytes --]

On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> On 09/02/17 14:51, David Gibson wrote:
> > On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >> For the emulated devices it does not matter much if we get a broken TCE
> >> half way handling a TCE list but for VFIO it will matter as it has
> >> more chances to fail so we try to do our best and check as much as we
> >> can before proceeding.
> >>
> >> This separates a guest view table update from validation. No change in
> >> behavior is expected.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 15df8ae627d9..9a7b7fca5e84 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		if (get_user(tce, tces + i)) {
> >> +			ret = H_TOO_HARD;
> >> +			goto unlock_exit;
> >> +		}
> >> +		tce = be64_to_cpu(tce);
> > 
> > This doesn't look safe.  The contents of user memory could change
> > between the two get_user()s, meaning that you're no longer guaranteed
> > a TCE loaded into kernel has been validated at all.
> > 
> > I think you need to either:
> > 
> >     a) Make sure things safe against a bad TCE being loaded into a TCE
> >     table and move all validation to where the TCE is used, rather
> >     than loaded
> > 
> > or
> >     b) Copy the whole set of indirect entries to a temporary in-kernel
> >        buffer, then validate, then load into the actual TCE table.
> 
> 
> Correct :( The problem is I do not know how far I want to go in reverting
> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> 
> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> address.
> 
> 
> To do a) I'll need to remember old content of each hardware table entry as
> when I reach TCE#100, I'll need to revert to the initial state which means
> I need to write back old TCEs to all affected hardware tables and update
> reference counters of all affected preregistered areas. Well, the actual
> tables must not have different addresses (BUG_ON? is it worth testing while
> writing to hardware tables that values I am replacing are the same in all
> tables?) so I can have just a single array of old TCEs from hardware tables
> in vcpu.

I thought you said shared tables were disabled, so the two tables
would have different addresses?

Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
only if the guest/qemu does something wrong, or can it fail for other
reasons?  What about in real mode vs. virtual mode?

I think the key to this approach will be to think carefully about what
semantics you guarantee for mappings shadowed into the hardware
tables.  For example, it might work to specify that the host mappings
only match the GPA mappings if those GPA mapings are valid in the
first place.  So, H_PUT_TCE etc. would succeed as long as they're able
to update the view of the table in terms of GPA.  But when you shadow
those into the HPA tables, any entries which can't be translated you
just replace with a cleared entry. That should be enough to protect
the host.  Obviously you can expect the device to fail when you
actually attempt to DMA there, but that's the guest's (or qemu's) own
fault for putting bad addresses in the TCE table.

Obviously that might not be great for debugging, since mappings will
appear to succeed, but then not work later on.

This does have the nice property that it's reasonably obvious what to
do if you have some GPA mappings for emulated devices, then hotplug a
VFIO device and at that point hit a gpa->hpa translation error.
There's no hcall in this case, so there's no obvious way to return an
error to the guest.

> To do b) I'll need:
> 
> 1. to have a copy of TCEs from the guest in vcpu,

I don't quite understand this.  You need a temporary copy, yes, but I
don't see why it needs to be attached to the vcpu.

> I populate it via
> get_user() to make sure they won't change;
> 2. an array of userspace addresses translated from given TCEs; and in order
> to make sure these addresses won't go away, I'll need to reference each
> preregistered memory area via mm_iommu_mapped_inc().
> 
> When I reach TCE#100, I'll have to revert the change, i.e. call
> mm_iommu_mapped_dec().

Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
the updated translations into a temp buffer.  Then you'd to make more
temp buffers to store the UA and HPA translations (although maybe you
could overwrite/reuse the original temp buffer if you're careful).
Then only if all of those succeed do you copy them into the real
hardware tables.

Which sounds like it might be kinda messy, at least in real mode.

> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> 
> 
> Or I can do simpler version of b) which would store guest TCEs in
> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> request, some preregistered regions will stay referenced till the guest is
> killed or rebooted (and this will prevent memory from unregistering) - but
> this means no harm to the host;

Hrm.. that's not really true.  It's not the worst thing that can
happen, but allowing the guest to permanently lock extra chunks of
memory is a form of harm to the host.

> and with preregistered RAM, there is no
> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> 
> 
> 
> Which approach to pick?
> 
> 
> LoPAPR says:
> ===
> If the TCE parameter represents the logical page address of a page that is
> not valid for the calling partition, return
> H_Parameter.
> ===
> 
> 
> 
> >>  
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index 918af76ab2b6..f8a54b7c788e 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS;
> >> -	unsigned long tces, entry, ua = 0;
> >> +	unsigned long tces, entry, tce, ua = 0;
> >>  	unsigned long *rmap = NULL;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>  
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> > 
> > Same problem here.
> > 
> >>  
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-10  3:07         ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-10  3:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 7810 bytes --]

On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> On 09/02/17 14:51, David Gibson wrote:
> > On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >> For the emulated devices it does not matter much if we get a broken TCE
> >> half way handling a TCE list but for VFIO it will matter as it has
> >> more chances to fail so we try to do our best and check as much as we
> >> can before proceeding.
> >>
> >> This separates a guest view table update from validation. No change in
> >> behavior is expected.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 15df8ae627d9..9a7b7fca5e84 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		if (get_user(tce, tces + i)) {
> >> +			ret = H_TOO_HARD;
> >> +			goto unlock_exit;
> >> +		}
> >> +		tce = be64_to_cpu(tce);
> > 
> > This doesn't look safe.  The contents of user memory could change
> > between the two get_user()s, meaning that you're no longer guaranteed
> > a TCE loaded into kernel has been validated at all.
> > 
> > I think you need to either:
> > 
> >     a) Make sure things safe against a bad TCE being loaded into a TCE
> >     table and move all validation to where the TCE is used, rather
> >     than loaded
> > 
> > or
> >     b) Copy the whole set of indirect entries to a temporary in-kernel
> >        buffer, then validate, then load into the actual TCE table.
> 
> 
> Correct :( The problem is I do not know how far I want to go in reverting
> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> 
> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> address.
> 
> 
> To do a) I'll need to remember old content of each hardware table entry as
> when I reach TCE#100, I'll need to revert to the initial state which means
> I need to write back old TCEs to all affected hardware tables and update
> reference counters of all affected preregistered areas. Well, the actual
> tables must not have different addresses (BUG_ON? is it worth testing while
> writing to hardware tables that values I am replacing are the same in all
> tables?) so I can have just a single array of old TCEs from hardware tables
> in vcpu.

I thought you said shared tables were disabled, so the two tables
would have different addresses?

Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
only if the guest/qemu does something wrong, or can it fail for other
reasons?  What about in real mode vs. virtual mode?

I think the key to this approach will be to think carefully about what
semantics you guarantee for mappings shadowed into the hardware
tables.  For example, it might work to specify that the host mappings
only match the GPA mappings if those GPA mapings are valid in the
first place.  So, H_PUT_TCE etc. would succeed as long as they're able
to update the view of the table in terms of GPA.  But when you shadow
those into the HPA tables, any entries which can't be translated you
just replace with a cleared entry. That should be enough to protect
the host.  Obviously you can expect the device to fail when you
actually attempt to DMA there, but that's the guest's (or qemu's) own
fault for putting bad addresses in the TCE table.

Obviously that might not be great for debugging, since mappings will
appear to succeed, but then not work later on.

This does have the nice property that it's reasonably obvious what to
do if you have some GPA mappings for emulated devices, then hotplug a
VFIO device and at that point hit a gpa->hpa translation error.
There's no hcall in this case, so there's no obvious way to return an
error to the guest.

> To do b) I'll need:
> 
> 1. to have a copy of TCEs from the guest in vcpu,

I don't quite understand this.  You need a temporary copy, yes, but I
don't see why it needs to be attached to the vcpu.

> I populate it via
> get_user() to make sure they won't change;
> 2. an array of userspace addresses translated from given TCEs; and in order
> to make sure these addresses won't go away, I'll need to reference each
> preregistered memory area via mm_iommu_mapped_inc().
> 
> When I reach TCE#100, I'll have to revert the change, i.e. call
> mm_iommu_mapped_dec().

Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
the updated translations into a temp buffer.  Then you'd to make more
temp buffers to store the UA and HPA translations (although maybe you
could overwrite/reuse the original temp buffer if you're careful).
Then only if all of those succeed do you copy them into the real
hardware tables.

Which sounds like it might be kinda messy, at least in real mode.

> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> 
> 
> Or I can do simpler version of b) which would store guest TCEs in
> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> request, some preregistered regions will stay referenced till the guest is
> killed or rebooted (and this will prevent memory from unregistering) - but
> this means no harm to the host;

Hrm.. that's not really true.  It's not the worst thing that can
happen, but allowing the guest to permanently lock extra chunks of
memory is a form of harm to the host.

> and with preregistered RAM, there is no
> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> 
> 
> 
> Which approach to pick?
> 
> 
> LoPAPR says:
> ===
> If the TCE parameter represents the logical page address of a page that is
> not valid for the calling partition, return
> H_Parameter.
> ===
> 
> 
> 
> >>  
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index 918af76ab2b6..f8a54b7c788e 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS;
> >> -	unsigned long tces, entry, ua = 0;
> >> +	unsigned long tces, entry, tce, ua = 0;
> >>  	unsigned long *rmap = NULL;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>  
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> > 
> > Same problem here.
> > 
> >>  
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
  2017-02-10  2:50       ` Alexey Kardashevskiy
  (?)
@ 2017-02-10  4:02         ` David Gibson
  -1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-10  4:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 41765 bytes --]

On Fri, Feb 10, 2017 at 01:50:31PM +1100, Alexey Kardashevskiy wrote:
> On 09/02/17 17:41, David Gibson wrote:
> > On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>
> >> This has separate copies of handlers for real and virtual modes as
> >> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> >> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> >> in virtual mode and direct access in real mode and having a common
> >> helper for it would make things uglier imho.
> >>
> >>
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  8 files changed, 590 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +	@flags are not supported now, must be zero;
> >> +	@groupfd is a file descriptor for a VFIO group;
> >> +	@tablefd is a file descriptor for a TCE table allocated via
> >> +		KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index e59b172666cd..a827006941f8 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +	struct rcu_head rcu;
> >> +	struct list_head next;
> >> +	struct vfio_group *group;
> >> +	struct iommu_table *tbl;
> >> +};
> >> +
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >> +	struct list_head iommu_tables;
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index 37bc9e7e90ba..da1410bd6b36 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group);
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index a2c9bb5a0ead..cdfa01169bd2 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>  
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> +struct kvm_vfio_spapr_tce {
> >> +	__u32	argsz;
> >> +	__u32	flags;
> >> +	__s32	groupfd;
> >> +	__s32	tablefd;
> >> +};
> >> +
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 9a7b7fca5e84..cb0469151e35 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,10 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/file.h>
> >> +#include <linux/vfio.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -39,6 +43,36 @@
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >>  #include <asm/tce.h>
> >> +#include <asm/mmu_context.h>
> >> +
> >> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >> +{
> >> +	void (*fn)(struct vfio_group *);
> >> +
> >> +	fn = symbol_get(vfio_group_put_external_user);
> >> +	if (WARN_ON(!fn))
> >> +		return;
> >> +
> >> +	fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_group_put_external_user);
> >> +}
> >> +
> >> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >> +{
> >> +	int (*fn)(struct vfio_group *);
> >> +	int ret = -1;
> >> +
> >> +	fn = symbol_get(vfio_external_user_iommu_id);
> >> +	if (!fn)
> >> +		return ret;
> >> +
> >> +	ret = fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_external_user_iommu_id);
> >> +
> >> +	return ret;
> >> +}
> >>  
> >>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >>  {
> >> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >>  	return ret;
> >>  }
> >>  
> >> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> >> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> >> +
> >> +	iommu_table_put(stit->tbl);
> >> +	kvm_vfio_group_put_external_user(stit->group);
> >> +
> >> +	kfree(stit);
> >> +}
> >> +
> >> +static void kvm_spapr_tce_liobn_release_iommu_group(
> >> +		struct kvmppc_spapr_tce_table *stt,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> >> +
> >> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> >> +		if (group && (stit->group != group))
> >> +			continue;
> >> +
> >> +		list_del_rcu(&stit->next);
> >> +
> >> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> >> +	}
> >> +}
> >> +
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt;
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> >> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> >> +}
> >> +
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt = NULL;
> >> +	bool found = false;
> >> +	struct iommu_table *tbl = NULL;
> >> +	struct iommu_table_group *table_group;
> >> +	long i, ret = 0;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	struct fd f;
> >> +	int group_id;
> >> +	struct iommu_group *grp;
> >> +
> >> +	group_id = kvm_vfio_external_user_iommu_id(group);
> >> +	grp = iommu_group_get_by_id(group_id);
> >> +	if (!grp)
> >> +		return -EFAULT;
> > 
> > EFAULT doesn't look right, that's usually means userspace has give us
> > a bad address.  What does failure to look up the iommu group by id
> > mean here?
> 
> 
> iommu_group_get_by_id() can fail -
> 1. if "something went very wrong" - as group ids are allocated when devices
> are discovered so they are pretty static;
> 2. there is some racy sriov disable or host pci hotunplug;

Ok, sounds like it should be a WARN_ON() plus.. hmm EIO, I guess?

> 3. kvm_vfio_external_user_iommu_id() returned invalid group id which means
> that a device was unbound from the vfio-pci driver but the caller holds a
> reference to vfio_group so this should not happen.

Ok this case you can distinguish with a check on the previous line.
So you can turn that into a WARN_ON() and EIO.

> 
> 
> > 
> >> +
> >> +	f = fdget(tablefd);
> >> +	if (!f.file) {
> >> +		ret = -EBADF;
> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >> +		if (stt == f.file->private_data) {
> >> +			found = true;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	fdput(f);
> >> +
> >> +	if (!found) {
> >> +		ret = -ENODEV;
> > 
> > ENODEV doesn't look right either.  That generally means you're trying
> > to use a device or facility that doesn't exist.  This case just means
> > you've passed a file handle that either isn't a TCE table at all, or
> > os one associated with a different VM.  -EINVAL, I guess, overloaded
> > as it is.
> 
> Ok.
> 
> 
> 
> > 
> >> +		goto put_exit;
> > 
> > Don't you need to put the table fd as well as the iommu group which
> > you put in that exit path?
> 
> 
> It is put few lines above.

Oh, yes, sorry.

> 
> 
> >> +	}
> >> +
> >> +	table_group = iommu_group_get_iommudata(grp);
> >> +	if (WARN_ON(!table_group)) {
> >> +		ret = -EFAULT;
> >> +		goto put_exit;
> > 
> > Again don't you need to put the table fd as well.
> 
> It is put few lines above, I do not keep it open longer than needed.
> 
> 
> > 
> >> +	}
> >> +
> >> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >> +		struct iommu_table *tbltmp = table_group->tables[i];
> >> +
> >> +		if (!tbltmp)
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * Make sure hardware table parameters are exactly the same;
> >> +		 * this is used in the TCE handlers where boundary checks
> >> +		 * use only the first attached table.
> >> +		 */
> >> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> >> +				(tbltmp->it_offset == stt->offset) &&
> >> +				(tbltmp->it_size == stt->size)) {
> >> +			tbl = tbltmp;
> >> +			break;
> >> +		}
> >> +	}
> >> +	if (!tbl) {
> >> +		ret = -ENODEV;
> > 
> > Again, ENODEV doesn't seem right.  Here the problem is that the host
> > hardware constraints don't match the guest hardware constraints.
> > Hmm.  EIO?  ENOSPC?
> 
> 
> Neither is very appealing to me... EINVAL?
> When I use "ENODEV", I am thinking of "there is no device with
> expected/requested characteristics" but this is probably wrong.

Yeah, generally ENODEV means no device at all - for example if you
mknod a device file with bogus numbers then try to access it that's
what you'll get.

EINVAL is correct, I guess, though I try to avoid it if there's any
excuse to do so, since it's so common.  I'll grant ENOSPC is an odd
suggestion: my rationale is that ENOSPC in its usual sense clearly
doesn't apply here, so it's not ambiguous with that.  Then, it's
vaguely thematically appropriate - you can't find space in the host
mapping windows to accommodate the guest mapping windows.  Bit of a
stretch, maybe.

> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	iommu_table_get(tbl);
> >> +
> >> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> >> +	stit->tbl = tbl;
> >> +	stit->group = group;
> >> +
> >> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> > 
> > So if you add the same group to the same liobn multiple times, you'll
> > get multiple identical entries in this list.
> > 
> > I guess that's mostly harmless... although.. does it allow the user to
> > force the allocation of arbitrary amounts of kernel memory in that
> > list?
> 
> 
> Oh. No, I'll add a check to avoid duplicates, they do not make sense here.
> 
> 
> > 
> >> +put_exit:
> >> +	iommu_group_put(grp);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static void release_spapr_tce_table(struct rcu_head *head)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>  
> >>  	list_del_rcu(&stt->list);
> >>  
> >> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> >> +
> >>  	kvm_put_kvm(stt->kvm);
> >>  
> >>  	kvmppc_account_memlimit(
> >> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	stt->offset = args->offset;
> >>  	stt->size = size;
> >>  	stt->kvm = kvm;
> >> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >>  
> >>  	for (i = 0; i < npages; i++) {
> >>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	return ret;
> >>  }
> >>  
> >> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> > 
> > What could trigger this error?  Should it be a WARN_ON?
> 
> Nothing should so yes, it can be WARN_ON.

Ok.

> 
> 
> > 
> >> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +	long ret;
> >> +
> >> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +	if (ret != H_SUCCESS)
> >> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >> +		return H_PARAMETER;
> >> +
> >> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> > 
> > IIUC this would happen if qemu had failed to preregister all of guest
> > RAM, making this indeed an H_HARDWARE.
> 
> 
> If QEMU failed to preregister, then mm_iommu_lookup() fails and it is
> TOO_HARD. mm_iommu_ua_to_hpa() in this context cannot possibly fail (unless
> broken memory) as it only returns error when out of bounds but
> mm_iommu_lookup() ensures this.

Ah, ok so it should be a WARN_ON + H_HARDWARE.

> 
> 
> 
> > 
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> > 
> > I'm less clear on when this one would happen.
> 
> 
> This may happen when there is a race with mm_iommu_put().

Ah, so I guess H_CLOSED could make sense here?

> 
> 
> > 
> >> +
> >> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		      unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >> -	long ret;
> >> +	long ret, idx;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	unsigned long entry, gpa;
> >> +	enum dma_data_direction dir;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		entry = ioba >> stit->tbl->it_page_shift;
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +		dir = iommu_tce_direction(tce);
> >> +
> >> +		if (dir == DMA_NONE) {
> >> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> >> +				return H_PARAMETER;
> >> +		} else {
> >> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> > 
> > Any way you could make these param check functions based on stt
> > instead of stit->tbl?  That would let you do them before checking if
> > there are any hw tables to update, avaoiding the somewhat awkward
> > 	if (at least one)
> > 		for (each one)
> > construct.
> 
> I could:
> 1. change iommu_tce_put_param_check() to take shift, offset, size and drop
> use of IOMMU_PAGE_MASK(tbl) (and change all callers in vfio_iommu_spapr_tce.c);
> 2. make a copy of iommu_tce_put_param_check() which would take stt.

I'd suggest doing (1) but giving the full version a new name, then
define both a tbl and stt version as trivial wrappers on that.  Makes
this a bit neater without having to change all the non-KVM related callers.

> And yet this code does operate with tbl anyway, akward either way imho...
> 
> 
> 
> > 
> >> +				return H_PARAMETER;
> >> +		}
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			if (dir == DMA_NONE) {
> >> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry);
> >> +			} else {
> >> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> >> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> >> +						entry, gpa, dir);
> >> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >> +			}
> >> +			if (ret != H_SUCCESS)
> >> +				return ret;
> > 
> > Doesn't this error path need to clean up for the case where you
> > managed to update some backing TCE tables, but then failed later ones?
> 
> Probably.
> 
> This is what I asked in:
> Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
> 
> Failure to update a hardware TCE table means we are in deep trouble, I
> cannot think of any valid reason how we could get this far and not fail
> before but fail now.

Ok, I've made some suggestions about that in reply to that patch.
> 
> 
> > 
> >> +		}
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS, idx;
> >> -	unsigned long entry, ua = 0;
> >> +	unsigned long entry, gpa, ua = 0;
> >>  	u64 __user *tces;
> >>  	u64 tce;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  	tces = (u64 __user *) ua;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +
> >>  	for (i = 0; i < npages; ++i) {
> >>  		if (get_user(tce, tces + i)) {
> >>  			ret = H_TOO_HARD;
> >> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +
> >> +		if (stit) {
> >> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +			ret = iommu_tce_put_param_check(stit->tbl,
> >> +					ioba + (i << stit->tbl->it_page_shift),
> >> +					gpa);
> >> +			if (ret != H_SUCCESS)
> >> +				goto unlock_exit;
> >> +		}
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		}
> >>  		tce = be64_to_cpu(tce);
> >>  
> >> +		if (stit) {
> >> +			for (i = 0; i < npages; ++i) {
> >> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +				list_for_each_entry_lockless(stit,
> >> +						&stt->iommu_tables, next) {
> >> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> >> +						stit->tbl, entry + i, gpa,
> >> +						iommu_tce_direction(tce));
> >> +					if (ret != H_SUCCESS)
> >> +						goto unlock_exit;
> >> +				}
> > 
> > Um.. what value will this for_each leave in stit after completion?  I
> > suspect it will be something bogus, which means re-using stit in the
> > next 0..npages loop iteration won't be safe (you only initialize stit
> > with the first entry outside that loop).
> 
> 
> #define list_for_each_entry_lockless(pos, head, member) \
>   for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
>      &pos->member != (head); \
>      pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
> 
> stit is "pos" which is reset every time the loop is called.

Um.. I'm not concerned about the access to stit within the
list_for_each().  It's the 'if (stit)' a few lines above I'm worried
about.

On the first iteration of the *outer* loop (for i=0..npages) stit has
been set correctly to list_first_entry_or_null().  But on subsequent
iteratoins of that outer loop, it has whatever value it has after the
completion of the list_for_each() in the previious iteration of the
outer loop.  I don't think it's wise to rely on what that value will
be.

Simplest fix would be to introduce a stit2 as the counter for the
inner loop.

> 
> 
> > 
> >> +			}
> >> +		}
> >> +
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >>  
> >> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> >> +					tce_value, npages))
> >> +			return H_PARAMETER;
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> >> +
> >> +			for (i = 0; i < npages; ++i) {
> >> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry + i);
> >> +				if (ret)
> >> +					return ret;
> > 
> > Again do you need some sort of cleanup for partial completion?
> 
> Again,
> Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
> 
> This is an unexpected failure which should not happen, what kind of cleanup
> it would make sense to do here? Re-map what was mapped before H_STUFF_TCE
> was called?

Ok, documenting to me the fact that it's a "can't happen" is one of
the reasons I like to see WARN_ON()s in those cases.

> 
> > 
> > 
> >> +			}
> >> +		}
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index dc1c66fda941..018c7d94a575 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>  
> >>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +	long ret;
> >> +
> >> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +	if (ret)
> >> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa = 0, ua;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> >> +		return H_PARAMETER;
> >> +
> >> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> > 
> > What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?
> 
> 
> When kernel memory gets corrupted and vmalloc_to_page() won't be able to
> find a page which was allocated with vmalloc.

Ok, so again there should be a WARN_ON().

> 
> 
> >> +
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> >> +
> >> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> >> +
> >>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	unsigned long entry, gpa;
> >> +	enum dma_data_direction dir;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		entry = ioba >> stit->tbl->it_page_shift;
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +		dir = iommu_tce_direction(tce);
> >> +
> >> +		if (dir == DMA_NONE) {
> >> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> >> +				return H_PARAMETER;
> >> +		} else {
> >> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> >> +				return H_PARAMETER;
> >> +		}
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			if (dir == DMA_NONE)
> >> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry);
> >> +			else
> >> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> >> +						entry, gpa, dir);
> >> +			if (ret != H_SUCCESS)
> >> +				return ret;
> >> +		}
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS;
> >> -	unsigned long tces, entry, tce, ua = 0;
> >> +	unsigned long tces, entry, gpa, tce, ua = 0;
> >>  	unsigned long *rmap = NULL;
> >>  	bool prereg = false;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		}
> >>  	}
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +
> >>  	for (i = 0; i < npages; ++i) {
> >>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>  
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +
> >> +		if (stit) {
> >> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +			ret = iommu_tce_put_param_check(stit->tbl,
> >> +					ioba + (i << stit->tbl->it_page_shift),
> >> +					gpa);
> >> +			if (ret != H_SUCCESS)
> >> +				goto unlock_exit;
> >> +
> >> +		}
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> > 
> > As noted in the earlier patch this is really dangerous - by reloading
> > the tce from userspace you've thrown away the verification above.
> 
> 
> Sure, I am adding a tces cache to kvm_vcpu.

> 
> 
> >> +		if (stit) {
> >> +			for (i = 0; i < npages; ++i) {
> >> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +				list_for_each_entry_lockless(stit,
> >> +						&stt->iommu_tables, next) {
> >> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> >> +						stit->tbl, entry + i, gpa,
> >> +						iommu_tce_direction(tce));
> >> +					if (ret != H_SUCCESS)
> >> +						goto unlock_exit;
> >> +				}
> >> +			}
> >> +		}
> >> +
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >>  
> >> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> >> +					tce_value, npages))
> >> +			return H_PARAMETER;
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> >> +
> >> +			for (i = 0; i < npages; ++i) {
> >> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry + i);
> >> +				if (ret)
> >> +					return ret;
> >> +			}
> >> +		}
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >> index cd892dec7cb6..f3127dc87912 100644
> >> --- a/arch/powerpc/kvm/powerpc.c
> >> +++ b/arch/powerpc/kvm/powerpc.c
> >> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >> +		/* fallthrough */
> > 
> > I'm not sure why this one should get a fallthrough comment, when none
> > of the other cases do.
> 
> 
> I believe it was either ignored then or checkpatch.pl did not warn about
> this at the time.

Hm. Sounds like a bug in checkpatch.pl TBH.  Fall through after
executing code for one case definitely requires a comment IMO,
fallthrough from an empty label - i.e. where there's just a bunch of
different labels with the same code block doesn't require one, I feel.

> 
> 
> > 
> >> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>  	case KVM_CAP_PPC_ENABLE_HCALL:
> >> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >> index d32f239eb471..2b7dc22265fe 100644
> >> --- a/virt/kvm/vfio.c
> >> +++ b/virt/kvm/vfio.c
> >> @@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +#include <asm/kvm_ppc.h>
> >> +#endif
> >> +
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  
> >>  		mutex_unlock(&kv->lock);
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >>  
> >>  		kvm_vfio_group_put_external_user(vfio_group);
> >> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>  
> >>  		return ret;
> >> +
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> >> +		struct kvm_vfio_spapr_tce param;
> >> +		unsigned long minsz;
> >> +		struct kvm_vfio *kv = dev->private;
> >> +		struct vfio_group *vfio_group;
> >> +		struct kvm_vfio_group *kvg;
> >> +		struct fd f;
> >> +
> >> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> >> +
> >> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (param.argsz < minsz || param.flags)
> >> +			return -EINVAL;
> >> +
> >> +		f = fdget(param.groupfd);
> >> +		if (!f.file)
> >> +			return -EBADF;
> >> +
> >> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >> +		fdput(f);
> >> +
> >> +		if (IS_ERR(vfio_group))
> >> +			return PTR_ERR(vfio_group);
> >> +
> > 
> > 
> > Is there any particular reason you unwrap the group fd here, but the
> > table fd inside kvm__spapr_tce_attach_iommu_group()?
> 
> No particular reason, just an intention not to spread too much spapr to KVM
> VFIO device and vfio_group to POWER KVM.
>
> I only unwrapp table_fd to see if it is in the kvm->arch.spapr_tce_tables
> list, I am trying to keep spapr_tce_tables and kvmppc_spapr_tce_iommu_table
> local to arch/powerpc/kvm/book3s_64_vio*.c
> 
> Unwrapping groupfd in arch/powerpc/kvm/book3s_64_vio*.c would mean
> duplicating all kvm_vfio_group_get_external_user()/etc stubs in
> arch/powerpc/kvm/book3s_64_vio.c, I did not want to duplicate these stubs.
> I could but since I already have vfio_group unwrapped here, it seems
> pointless to unwrap it over again in arch/powerpc/kvm/book3s_64_vio.c,
> should I?

Ok, that seems like an adequate reason to do it this way.

> 
> 
> 
> > 
> >> +		ret = -ENOENT;
> >> +
> >> +		mutex_lock(&kv->lock);
> >> +
> >> +		list_for_each_entry(kvg, &kv->group_list, node) {
> >> +			if (kvg->vfio_group != vfio_group)
> >> +				continue;
> >> +
> >> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >> +					param.tablefd, vfio_group);
> >> +
> >> +			break;
> >> +		}
> >> +
> >> +		mutex_unlock(&kv->lock);
> >> +
> >> +		return ret;
> >> +	}
> >> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> >>  
> >>  	return -ENXIO;
> >> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> >> +#endif
> >>  			return 0;
> >>  		}
> >>  
> >> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >>  	struct kvm_vfio_group *kvg, *tmp;
> >>  
> >>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  		list_del(&kvg->node);
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-02-10  4:02         ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-10  4:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 41765 bytes --]

On Fri, Feb 10, 2017 at 01:50:31PM +1100, Alexey Kardashevskiy wrote:
> On 09/02/17 17:41, David Gibson wrote:
> > On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>
> >> This has separate copies of handlers for real and virtual modes as
> >> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> >> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> >> in virtual mode and direct access in real mode and having a common
> >> helper for it would make things uglier imho.
> >>
> >>
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  8 files changed, 590 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +	@flags are not supported now, must be zero;
> >> +	@groupfd is a file descriptor for a VFIO group;
> >> +	@tablefd is a file descriptor for a TCE table allocated via
> >> +		KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index e59b172666cd..a827006941f8 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +	struct rcu_head rcu;
> >> +	struct list_head next;
> >> +	struct vfio_group *group;
> >> +	struct iommu_table *tbl;
> >> +};
> >> +
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >> +	struct list_head iommu_tables;
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index 37bc9e7e90ba..da1410bd6b36 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group);
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index a2c9bb5a0ead..cdfa01169bd2 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>  
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> +struct kvm_vfio_spapr_tce {
> >> +	__u32	argsz;
> >> +	__u32	flags;
> >> +	__s32	groupfd;
> >> +	__s32	tablefd;
> >> +};
> >> +
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 9a7b7fca5e84..cb0469151e35 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,10 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/file.h>
> >> +#include <linux/vfio.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -39,6 +43,36 @@
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >>  #include <asm/tce.h>
> >> +#include <asm/mmu_context.h>
> >> +
> >> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >> +{
> >> +	void (*fn)(struct vfio_group *);
> >> +
> >> +	fn = symbol_get(vfio_group_put_external_user);
> >> +	if (WARN_ON(!fn))
> >> +		return;
> >> +
> >> +	fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_group_put_external_user);
> >> +}
> >> +
> >> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >> +{
> >> +	int (*fn)(struct vfio_group *);
> >> +	int ret = -1;
> >> +
> >> +	fn = symbol_get(vfio_external_user_iommu_id);
> >> +	if (!fn)
> >> +		return ret;
> >> +
> >> +	ret = fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_external_user_iommu_id);
> >> +
> >> +	return ret;
> >> +}
> >>  
> >>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >>  {
> >> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >>  	return ret;
> >>  }
> >>  
> >> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> >> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> >> +
> >> +	iommu_table_put(stit->tbl);
> >> +	kvm_vfio_group_put_external_user(stit->group);
> >> +
> >> +	kfree(stit);
> >> +}
> >> +
> >> +static void kvm_spapr_tce_liobn_release_iommu_group(
> >> +		struct kvmppc_spapr_tce_table *stt,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> >> +
> >> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> >> +		if (group && (stit->group != group))
> >> +			continue;
> >> +
> >> +		list_del_rcu(&stit->next);
> >> +
> >> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> >> +	}
> >> +}
> >> +
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt;
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> >> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> >> +}
> >> +
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt = NULL;
> >> +	bool found = false;
> >> +	struct iommu_table *tbl = NULL;
> >> +	struct iommu_table_group *table_group;
> >> +	long i, ret = 0;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	struct fd f;
> >> +	int group_id;
> >> +	struct iommu_group *grp;
> >> +
> >> +	group_id = kvm_vfio_external_user_iommu_id(group);
> >> +	grp = iommu_group_get_by_id(group_id);
> >> +	if (!grp)
> >> +		return -EFAULT;
> > 
> > EFAULT doesn't look right, that's usually means userspace has give us
> > a bad address.  What does failure to look up the iommu group by id
> > mean here?
> 
> 
> iommu_group_get_by_id() can fail -
> 1. if "something went very wrong" - as group ids are allocated when devices
> are discovered so they are pretty static;
> 2. there is some racy sriov disable or host pci hotunplug;

Ok, sounds like it should be a WARN_ON() plus.. hmm EIO, I guess?

> 3. kvm_vfio_external_user_iommu_id() returned invalid group id which means
> that a device was unbound from the vfio-pci driver but the caller holds a
> reference to vfio_group so this should not happen.

Ok this case you can distinguish with a check on the previous line.
So you can turn that into a WARN_ON() and EIO.

> 
> 
> > 
> >> +
> >> +	f = fdget(tablefd);
> >> +	if (!f.file) {
> >> +		ret = -EBADF;
> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >> +		if (stt == f.file->private_data) {
> >> +			found = true;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	fdput(f);
> >> +
> >> +	if (!found) {
> >> +		ret = -ENODEV;
> > 
> > ENODEV doesn't look right either.  That generally means you're trying
> > to use a device or facility that doesn't exist.  This case just means
> > you've passed a file handle that either isn't a TCE table at all, or
> > os one associated with a different VM.  -EINVAL, I guess, overloaded
> > as it is.
> 
> Ok.
> 
> 
> 
> > 
> >> +		goto put_exit;
> > 
> > Don't you need to put the table fd as well as the iommu group which
> > you put in that exit path?
> 
> 
> It is put few lines above.

Oh, yes, sorry.

> 
> 
> >> +	}
> >> +
> >> +	table_group = iommu_group_get_iommudata(grp);
> >> +	if (WARN_ON(!table_group)) {
> >> +		ret = -EFAULT;
> >> +		goto put_exit;
> > 
> > Again don't you need to put the table fd as well.
> 
> It is put few lines above, I do not keep it open longer than needed.
> 
> 
> > 
> >> +	}
> >> +
> >> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >> +		struct iommu_table *tbltmp = table_group->tables[i];
> >> +
> >> +		if (!tbltmp)
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * Make sure hardware table parameters are exactly the same;
> >> +		 * this is used in the TCE handlers where boundary checks
> >> +		 * use only the first attached table.
> >> +		 */
> >> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> >> +				(tbltmp->it_offset == stt->offset) &&
> >> +				(tbltmp->it_size == stt->size)) {
> >> +			tbl = tbltmp;
> >> +			break;
> >> +		}
> >> +	}
> >> +	if (!tbl) {
> >> +		ret = -ENODEV;
> > 
> > Again, ENODEV doesn't seem right.  Here the problem is that the host
> > hardware constraints don't match the guest hardware constraints.
> > Hmm.  EIO?  ENOSPC?
> 
> 
> Neither is very appealing to me... EINVAL?
> When I use "ENODEV", I am thinking of "there is no device with
> expected/requested characteristics" but this is probably wrong.

Yeah, generally ENODEV means no device at all - for example if you
mknod a device file with bogus numbers then try to access it that's
what you'll get.

EINVAL is correct, I guess, though I try to avoid it if there's any
excuse to do so, since it's so common.  I'll grant ENOSPC is an odd
suggestion: my rationale is that ENOSPC in its usual sense clearly
doesn't apply here, so it's not ambiguous with that.  Then, it's
vaguely thematically appropriate - you can't find space in the host
mapping windows to accommodate the guest mapping windows.  Bit of a
stretch, maybe.

> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	iommu_table_get(tbl);
> >> +
> >> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> >> +	stit->tbl = tbl;
> >> +	stit->group = group;
> >> +
> >> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> > 
> > So if you add the same group to the same liobn multiple times, you'll
> > get multiple identical entries in this list.
> > 
> > I guess that's mostly harmless... although.. does it allow the user to
> > force the allocation of arbitrary amounts of kernel memory in that
> > list?
> 
> 
> Oh. No, I'll add a check to avoid duplicates, they do not make sense here.
> 
> 
> > 
> >> +put_exit:
> >> +	iommu_group_put(grp);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static void release_spapr_tce_table(struct rcu_head *head)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>  
> >>  	list_del_rcu(&stt->list);
> >>  
> >> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> >> +
> >>  	kvm_put_kvm(stt->kvm);
> >>  
> >>  	kvmppc_account_memlimit(
> >> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	stt->offset = args->offset;
> >>  	stt->size = size;
> >>  	stt->kvm = kvm;
> >> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >>  
> >>  	for (i = 0; i < npages; i++) {
> >>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	return ret;
> >>  }
> >>  
> >> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> > 
> > What could trigger this error?  Should it be a WARN_ON?
> 
> Nothing should so yes, it can be WARN_ON.

Ok.

> 
> 
> > 
> >> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +	long ret;
> >> +
> >> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +	if (ret != H_SUCCESS)
> >> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >> +		return H_PARAMETER;
> >> +
> >> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> > 
> > IIUC this would happen if qemu had failed to preregister all of guest
> > RAM, making this indeed an H_HARDWARE.
> 
> 
> If QEMU failed to preregister, then mm_iommu_lookup() fails and it is
> TOO_HARD. mm_iommu_ua_to_hpa() in this context cannot possibly fail (unless
> broken memory) as it only returns error when out of bounds but
> mm_iommu_lookup() ensures this.

Ah, ok so it should be a WARN_ON + H_HARDWARE.

> 
> 
> 
> > 
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> > 
> > I'm less clear on when this one would happen.
> 
> 
> This may happen when there is a race with mm_iommu_put().

Ah, so I guess H_CLOSED could make sense here?

> 
> 
> > 
> >> +
> >> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		      unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >> -	long ret;
> >> +	long ret, idx;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	unsigned long entry, gpa;
> >> +	enum dma_data_direction dir;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		entry = ioba >> stit->tbl->it_page_shift;
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +		dir = iommu_tce_direction(tce);
> >> +
> >> +		if (dir == DMA_NONE) {
> >> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> >> +				return H_PARAMETER;
> >> +		} else {
> >> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> > 
> > Any way you could make these param check functions based on stt
> > instead of stit->tbl?  That would let you do them before checking if
> > there are any hw tables to update, avaoiding the somewhat awkward
> > 	if (at least one)
> > 		for (each one)
> > construct.
> 
> I could:
> 1. change iommu_tce_put_param_check() to take shift, offset, size and drop
> use of IOMMU_PAGE_MASK(tbl) (and change all callers in vfio_iommu_spapr_tce.c);
> 2. make a copy of iommu_tce_put_param_check() which would take stt.

I'd suggest doing (1) but giving the full version a new name, then
define both a tbl and stt version as trivial wrappers on that.  Makes
this a bit neater without having to change all the non-KVM related callers.

> And yet this code does operate with tbl anyway, akward either way imho...
> 
> 
> 
> > 
> >> +				return H_PARAMETER;
> >> +		}
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			if (dir == DMA_NONE) {
> >> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry);
> >> +			} else {
> >> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> >> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> >> +						entry, gpa, dir);
> >> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >> +			}
> >> +			if (ret != H_SUCCESS)
> >> +				return ret;
> > 
> > Doesn't this error path need to clean up for the case where you
> > managed to update some backing TCE tables, but then failed later ones?
> 
> Probably.
> 
> This is what I asked in:
> Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
> 
> Failure to update a hardware TCE table means we are in deep trouble, I
> cannot think of any valid reason how we could get this far and not fail
> before but fail now.

Ok, I've made some suggestions about that in reply to that patch.
> 
> 
> > 
> >> +		}
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS, idx;
> >> -	unsigned long entry, ua = 0;
> >> +	unsigned long entry, gpa, ua = 0;
> >>  	u64 __user *tces;
> >>  	u64 tce;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  	tces = (u64 __user *) ua;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +
> >>  	for (i = 0; i < npages; ++i) {
> >>  		if (get_user(tce, tces + i)) {
> >>  			ret = H_TOO_HARD;
> >> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +
> >> +		if (stit) {
> >> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +			ret = iommu_tce_put_param_check(stit->tbl,
> >> +					ioba + (i << stit->tbl->it_page_shift),
> >> +					gpa);
> >> +			if (ret != H_SUCCESS)
> >> +				goto unlock_exit;
> >> +		}
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		}
> >>  		tce = be64_to_cpu(tce);
> >>  
> >> +		if (stit) {
> >> +			for (i = 0; i < npages; ++i) {
> >> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +				list_for_each_entry_lockless(stit,
> >> +						&stt->iommu_tables, next) {
> >> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> >> +						stit->tbl, entry + i, gpa,
> >> +						iommu_tce_direction(tce));
> >> +					if (ret != H_SUCCESS)
> >> +						goto unlock_exit;
> >> +				}
> > 
> > Um.. what value will this for_each leave in stit after completion?  I
> > suspect it will be something bogus, which means re-using stit in the
> > next 0..npages loop iteration won't be safe (you only initialize stit
> > with the first entry outside that loop).
> 
> 
> #define list_for_each_entry_lockless(pos, head, member) \
>   for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
>      &pos->member != (head); \
>      pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
> 
> stit is "pos" which is reset every time the loop is called.

Um.. I'm not concerned about the access to stit within the
list_for_each().  It's the 'if (stit)' a few lines above I'm worried
about.

On the first iteration of the *outer* loop (for i=0..npages) stit has
been set correctly to list_first_entry_or_null().  But on subsequent
iteratoins of that outer loop, it has whatever value it has after the
completion of the list_for_each() in the previious iteration of the
outer loop.  I don't think it's wise to rely on what that value will
be.

Simplest fix would be to introduce a stit2 as the counter for the
inner loop.

> 
> 
> > 
> >> +			}
> >> +		}
> >> +
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >>  
> >> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> >> +					tce_value, npages))
> >> +			return H_PARAMETER;
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> >> +
> >> +			for (i = 0; i < npages; ++i) {
> >> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry + i);
> >> +				if (ret)
> >> +					return ret;
> > 
> > Again do you need some sort of cleanup for partial completion?
> 
> Again,
> Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
> 
> This is an unexpected failure which should not happen, what kind of cleanup
> it would make sense to do here? Re-map what was mapped before H_STUFF_TCE
> was called?

Ok, documenting to me the fact that it's a "can't happen" is one of
the reasons I like to see WARN_ON()s in those cases.

> 
> > 
> > 
> >> +			}
> >> +		}
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index dc1c66fda941..018c7d94a575 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>  
> >>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +	long ret;
> >> +
> >> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +	if (ret)
> >> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa = 0, ua;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> >> +		return H_PARAMETER;
> >> +
> >> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> > 
> > What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?
> 
> 
> When kernel memory gets corrupted and vmalloc_to_page() won't be able to
> find a page which was allocated with vmalloc.

Ok, so again there should be a WARN_ON().

> 
> 
> >> +
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> >> +
> >> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> >> +
> >>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	unsigned long entry, gpa;
> >> +	enum dma_data_direction dir;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		entry = ioba >> stit->tbl->it_page_shift;
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +		dir = iommu_tce_direction(tce);
> >> +
> >> +		if (dir == DMA_NONE) {
> >> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> >> +				return H_PARAMETER;
> >> +		} else {
> >> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> >> +				return H_PARAMETER;
> >> +		}
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			if (dir == DMA_NONE)
> >> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry);
> >> +			else
> >> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> >> +						entry, gpa, dir);
> >> +			if (ret != H_SUCCESS)
> >> +				return ret;
> >> +		}
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS;
> >> -	unsigned long tces, entry, tce, ua = 0;
> >> +	unsigned long tces, entry, gpa, tce, ua = 0;
> >>  	unsigned long *rmap = NULL;
> >>  	bool prereg = false;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		}
> >>  	}
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +
> >>  	for (i = 0; i < npages; ++i) {
> >>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>  
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +
> >> +		if (stit) {
> >> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +			ret = iommu_tce_put_param_check(stit->tbl,
> >> +					ioba + (i << stit->tbl->it_page_shift),
> >> +					gpa);
> >> +			if (ret != H_SUCCESS)
> >> +				goto unlock_exit;
> >> +
> >> +		}
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> > 
> > As noted in the earlier patch this is really dangerous - by reloading
> > the tce from userspace you've thrown away the verification above.
> 
> 
> Sure, I am adding a tces cache to kvm_vcpu.

> 
> 
> >> +		if (stit) {
> >> +			for (i = 0; i < npages; ++i) {
> >> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +				list_for_each_entry_lockless(stit,
> >> +						&stt->iommu_tables, next) {
> >> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> >> +						stit->tbl, entry + i, gpa,
> >> +						iommu_tce_direction(tce));
> >> +					if (ret != H_SUCCESS)
> >> +						goto unlock_exit;
> >> +				}
> >> +			}
> >> +		}
> >> +
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >>  
> >> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> >> +					tce_value, npages))
> >> +			return H_PARAMETER;
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> >> +
> >> +			for (i = 0; i < npages; ++i) {
> >> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry + i);
> >> +				if (ret)
> >> +					return ret;
> >> +			}
> >> +		}
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >> index cd892dec7cb6..f3127dc87912 100644
> >> --- a/arch/powerpc/kvm/powerpc.c
> >> +++ b/arch/powerpc/kvm/powerpc.c
> >> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >> +		/* fallthrough */
> > 
> > I'm not sure why this one should get a fallthrough comment, when none
> > of the other cases do.
> 
> 
> I believe it was either ignored then or checkpatch.pl did not warn about
> this at the time.

Hm. Sounds like a bug in checkpatch.pl TBH.  Fall through after
executing code for one case definitely requires a comment IMO,
fallthrough from an empty label - i.e. where there's just a bunch of
different labels with the same code block doesn't require one, I feel.

> 
> 
> > 
> >> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>  	case KVM_CAP_PPC_ENABLE_HCALL:
> >> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >> index d32f239eb471..2b7dc22265fe 100644
> >> --- a/virt/kvm/vfio.c
> >> +++ b/virt/kvm/vfio.c
> >> @@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +#include <asm/kvm_ppc.h>
> >> +#endif
> >> +
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  
> >>  		mutex_unlock(&kv->lock);
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >>  
> >>  		kvm_vfio_group_put_external_user(vfio_group);
> >> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>  
> >>  		return ret;
> >> +
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> >> +		struct kvm_vfio_spapr_tce param;
> >> +		unsigned long minsz;
> >> +		struct kvm_vfio *kv = dev->private;
> >> +		struct vfio_group *vfio_group;
> >> +		struct kvm_vfio_group *kvg;
> >> +		struct fd f;
> >> +
> >> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> >> +
> >> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (param.argsz < minsz || param.flags)
> >> +			return -EINVAL;
> >> +
> >> +		f = fdget(param.groupfd);
> >> +		if (!f.file)
> >> +			return -EBADF;
> >> +
> >> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >> +		fdput(f);
> >> +
> >> +		if (IS_ERR(vfio_group))
> >> +			return PTR_ERR(vfio_group);
> >> +
> > 
> > 
> > Is there any particular reason you unwrap the group fd here, but the
> > table fd inside kvm__spapr_tce_attach_iommu_group()?
> 
> No particular reason, just an intention not to spread too much spapr to KVM
> VFIO device and vfio_group to POWER KVM.
>
> I only unwrapp table_fd to see if it is in the kvm->arch.spapr_tce_tables
> list, I am trying to keep spapr_tce_tables and kvmppc_spapr_tce_iommu_table
> local to arch/powerpc/kvm/book3s_64_vio*.c
> 
> Unwrapping groupfd in arch/powerpc/kvm/book3s_64_vio*.c would mean
> duplicating all kvm_vfio_group_get_external_user()/etc stubs in
> arch/powerpc/kvm/book3s_64_vio.c, I did not want to duplicate these stubs.
> I could but since I already have vfio_group unwrapped here, it seems
> pointless to unwrap it over again in arch/powerpc/kvm/book3s_64_vio.c,
> should I?

Ok, that seems like an adequate reason to do it this way.

> 
> 
> 
> > 
> >> +		ret = -ENOENT;
> >> +
> >> +		mutex_lock(&kv->lock);
> >> +
> >> +		list_for_each_entry(kvg, &kv->group_list, node) {
> >> +			if (kvg->vfio_group != vfio_group)
> >> +				continue;
> >> +
> >> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >> +					param.tablefd, vfio_group);
> >> +
> >> +			break;
> >> +		}
> >> +
> >> +		mutex_unlock(&kv->lock);
> >> +
> >> +		return ret;
> >> +	}
> >> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> >>  
> >>  	return -ENXIO;
> >> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> >> +#endif
> >>  			return 0;
> >>  		}
> >>  
> >> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >>  	struct kvm_vfio_group *kvg, *tmp;
> >>  
> >>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  		list_del(&kvg->node);
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
@ 2017-02-10  4:02         ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-10  4:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 41765 bytes --]

On Fri, Feb 10, 2017 at 01:50:31PM +1100, Alexey Kardashevskiy wrote:
> On 09/02/17 17:41, David Gibson wrote:
> > On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>
> >> This has separate copies of handlers for real and virtual modes as
> >> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> >> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> >> in virtual mode and direct access in real mode and having a common
> >> helper for it would make things uglier imho.
> >>
> >>
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  8 files changed, 590 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +	@flags are not supported now, must be zero;
> >> +	@groupfd is a file descriptor for a VFIO group;
> >> +	@tablefd is a file descriptor for a TCE table allocated via
> >> +		KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index e59b172666cd..a827006941f8 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +	struct rcu_head rcu;
> >> +	struct list_head next;
> >> +	struct vfio_group *group;
> >> +	struct iommu_table *tbl;
> >> +};
> >> +
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >> +	struct list_head iommu_tables;
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index 37bc9e7e90ba..da1410bd6b36 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group);
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index a2c9bb5a0ead..cdfa01169bd2 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>  
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> +struct kvm_vfio_spapr_tce {
> >> +	__u32	argsz;
> >> +	__u32	flags;
> >> +	__s32	groupfd;
> >> +	__s32	tablefd;
> >> +};
> >> +
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 9a7b7fca5e84..cb0469151e35 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,10 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/file.h>
> >> +#include <linux/vfio.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -39,6 +43,36 @@
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >>  #include <asm/tce.h>
> >> +#include <asm/mmu_context.h>
> >> +
> >> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >> +{
> >> +	void (*fn)(struct vfio_group *);
> >> +
> >> +	fn = symbol_get(vfio_group_put_external_user);
> >> +	if (WARN_ON(!fn))
> >> +		return;
> >> +
> >> +	fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_group_put_external_user);
> >> +}
> >> +
> >> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >> +{
> >> +	int (*fn)(struct vfio_group *);
> >> +	int ret = -1;
> >> +
> >> +	fn = symbol_get(vfio_external_user_iommu_id);
> >> +	if (!fn)
> >> +		return ret;
> >> +
> >> +	ret = fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_external_user_iommu_id);
> >> +
> >> +	return ret;
> >> +}
> >>  
> >>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >>  {
> >> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >>  	return ret;
> >>  }
> >>  
> >> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> >> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> >> +
> >> +	iommu_table_put(stit->tbl);
> >> +	kvm_vfio_group_put_external_user(stit->group);
> >> +
> >> +	kfree(stit);
> >> +}
> >> +
> >> +static void kvm_spapr_tce_liobn_release_iommu_group(
> >> +		struct kvmppc_spapr_tce_table *stt,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> >> +
> >> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> >> +		if (group && (stit->group != group))
> >> +			continue;
> >> +
> >> +		list_del_rcu(&stit->next);
> >> +
> >> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> >> +	}
> >> +}
> >> +
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt;
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> >> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> >> +}
> >> +
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt = NULL;
> >> +	bool found = false;
> >> +	struct iommu_table *tbl = NULL;
> >> +	struct iommu_table_group *table_group;
> >> +	long i, ret = 0;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	struct fd f;
> >> +	int group_id;
> >> +	struct iommu_group *grp;
> >> +
> >> +	group_id = kvm_vfio_external_user_iommu_id(group);
> >> +	grp = iommu_group_get_by_id(group_id);
> >> +	if (!grp)
> >> +		return -EFAULT;
> > 
> > EFAULT doesn't look right, that's usually means userspace has give us
> > a bad address.  What does failure to look up the iommu group by id
> > mean here?
> 
> 
> iommu_group_get_by_id() can fail -
> 1. if "something went very wrong" - as group ids are allocated when devices
> are discovered so they are pretty static;
> 2. there is some racy sriov disable or host pci hotunplug;

Ok, sounds like it should be a WARN_ON() plus.. hmm EIO, I guess?

> 3. kvm_vfio_external_user_iommu_id() returned invalid group id which means
> that a device was unbound from the vfio-pci driver but the caller holds a
> reference to vfio_group so this should not happen.

Ok this case you can distinguish with a check on the previous line.
So you can turn that into a WARN_ON() and EIO.

> 
> 
> > 
> >> +
> >> +	f = fdget(tablefd);
> >> +	if (!f.file) {
> >> +		ret = -EBADF;
> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >> +		if (stt == f.file->private_data) {
> >> +			found = true;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	fdput(f);
> >> +
> >> +	if (!found) {
> >> +		ret = -ENODEV;
> > 
> > ENODEV doesn't look right either.  That generally means you're trying
> > to use a device or facility that doesn't exist.  This case just means
> > you've passed a file handle that either isn't a TCE table at all, or
> > os one associated with a different VM.  -EINVAL, I guess, overloaded
> > as it is.
> 
> Ok.
> 
> 
> 
> > 
> >> +		goto put_exit;
> > 
> > Don't you need to put the table fd as well as the iommu group which
> > you put in that exit path?
> 
> 
> It is put few lines above.

Oh, yes, sorry.

> 
> 
> >> +	}
> >> +
> >> +	table_group = iommu_group_get_iommudata(grp);
> >> +	if (WARN_ON(!table_group)) {
> >> +		ret = -EFAULT;
> >> +		goto put_exit;
> > 
> > Again don't you need to put the table fd as well.
> 
> It is put few lines above, I do not keep it open longer than needed.
> 
> 
> > 
> >> +	}
> >> +
> >> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >> +		struct iommu_table *tbltmp = table_group->tables[i];
> >> +
> >> +		if (!tbltmp)
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * Make sure hardware table parameters are exactly the same;
> >> +		 * this is used in the TCE handlers where boundary checks
> >> +		 * use only the first attached table.
> >> +		 */
> >> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> >> +				(tbltmp->it_offset == stt->offset) &&
> >> +				(tbltmp->it_size == stt->size)) {
> >> +			tbl = tbltmp;
> >> +			break;
> >> +		}
> >> +	}
> >> +	if (!tbl) {
> >> +		ret = -ENODEV;
> > 
> > Again, ENODEV doesn't seem right.  Here the problem is that the host
> > hardware constraints don't match the guest hardware constraints.
> > Hmm.  EIO?  ENOSPC?
> 
> 
> Neither is very appealing to me... EINVAL?
> When I use "ENODEV", I am thinking of "there is no device with
> expected/requested characteristics" but this is probably wrong.

Yeah, generally ENODEV means no device at all - for example if you
mknod a device file with bogus numbers then try to access it that's
what you'll get.

EINVAL is correct, I guess, though I try to avoid it if there's any
excuse to do so, since it's so common.  I'll grant ENOSPC is an odd
suggestion: my rationale is that ENOSPC in its usual sense clearly
doesn't apply here, so it's not ambiguous with that.  Then, it's
vaguely thematically appropriate - you can't find space in the host
mapping windows to accommodate the guest mapping windows.  Bit of a
stretch, maybe.

> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	iommu_table_get(tbl);
> >> +
> >> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> >> +	stit->tbl = tbl;
> >> +	stit->group = group;
> >> +
> >> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> > 
> > So if you add the same group to the same liobn multiple times, you'll
> > get multiple identical entries in this list.
> > 
> > I guess that's mostly harmless... although.. does it allow the user to
> > force the allocation of arbitrary amounts of kernel memory in that
> > list?
> 
> 
> Oh. No, I'll add a check to avoid duplicates, they do not make sense here.
> 
> 
> > 
> >> +put_exit:
> >> +	iommu_group_put(grp);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static void release_spapr_tce_table(struct rcu_head *head)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>  
> >>  	list_del_rcu(&stt->list);
> >>  
> >> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> >> +
> >>  	kvm_put_kvm(stt->kvm);
> >>  
> >>  	kvmppc_account_memlimit(
> >> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	stt->offset = args->offset;
> >>  	stt->size = size;
> >>  	stt->kvm = kvm;
> >> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >>  
> >>  	for (i = 0; i < npages; i++) {
> >>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	return ret;
> >>  }
> >>  
> >> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> > 
> > What could trigger this error?  Should it be a WARN_ON?
> 
> Nothing should so yes, it can be WARN_ON.

Ok.

> 
> 
> > 
> >> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +	long ret;
> >> +
> >> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +	if (ret != H_SUCCESS)
> >> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >> +		return H_PARAMETER;
> >> +
> >> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> > 
> > IIUC this would happen if qemu had failed to preregister all of guest
> > RAM, making this indeed an H_HARDWARE.
> 
> 
> If QEMU failed to preregister, then mm_iommu_lookup() fails and it is
> TOO_HARD. mm_iommu_ua_to_hpa() in this context cannot possibly fail (unless
> broken memory) as it only returns error when out of bounds but
> mm_iommu_lookup() ensures this.

Ah, ok so it should be a WARN_ON + H_HARDWARE.

> 
> 
> 
> > 
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> > 
> > I'm less clear on when this one would happen.
> 
> 
> This may happen when there is a race with mm_iommu_put().

Ah, so I guess H_CLOSED could make sense here?

> 
> 
> > 
> >> +
> >> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		      unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >> -	long ret;
> >> +	long ret, idx;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	unsigned long entry, gpa;
> >> +	enum dma_data_direction dir;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		entry = ioba >> stit->tbl->it_page_shift;
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +		dir = iommu_tce_direction(tce);
> >> +
> >> +		if (dir == DMA_NONE) {
> >> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> >> +				return H_PARAMETER;
> >> +		} else {
> >> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> > 
> > Any way you could make these param check functions based on stt
> > instead of stit->tbl?  That would let you do them before checking if
> > there are any hw tables to update, avaoiding the somewhat awkward
> > 	if (at least one)
> > 		for (each one)
> > construct.
> 
> I could:
> 1. change iommu_tce_put_param_check() to take shift, offset, size and drop
> use of IOMMU_PAGE_MASK(tbl) (and change all callers in vfio_iommu_spapr_tce.c);
> 2. make a copy of iommu_tce_put_param_check() which would take stt.

I'd suggest doing (1) but giving the full version a new name, then
define both a tbl and stt version as trivial wrappers on that.  Makes
this a bit neater without having to change all the non-KVM related callers.

> And yet this code does operate with tbl anyway, akward either way imho...
> 
> 
> 
> > 
> >> +				return H_PARAMETER;
> >> +		}
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			if (dir == DMA_NONE) {
> >> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry);
> >> +			} else {
> >> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> >> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> >> +						entry, gpa, dir);
> >> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >> +			}
> >> +			if (ret != H_SUCCESS)
> >> +				return ret;
> > 
> > Doesn't this error path need to clean up for the case where you
> > managed to update some backing TCE tables, but then failed later ones?
> 
> Probably.
> 
> This is what I asked in:
> Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
> 
> Failure to update a hardware TCE table means we are in deep trouble, I
> cannot think of any valid reason how we could get this far and not fail
> before but fail now.

Ok, I've made some suggestions about that in reply to that patch.
> 
> 
> > 
> >> +		}
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS, idx;
> >> -	unsigned long entry, ua = 0;
> >> +	unsigned long entry, gpa, ua = 0;
> >>  	u64 __user *tces;
> >>  	u64 tce;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  	tces = (u64 __user *) ua;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +
> >>  	for (i = 0; i < npages; ++i) {
> >>  		if (get_user(tce, tces + i)) {
> >>  			ret = H_TOO_HARD;
> >> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +
> >> +		if (stit) {
> >> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +			ret = iommu_tce_put_param_check(stit->tbl,
> >> +					ioba + (i << stit->tbl->it_page_shift),
> >> +					gpa);
> >> +			if (ret != H_SUCCESS)
> >> +				goto unlock_exit;
> >> +		}
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		}
> >>  		tce = be64_to_cpu(tce);
> >>  
> >> +		if (stit) {
> >> +			for (i = 0; i < npages; ++i) {
> >> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +				list_for_each_entry_lockless(stit,
> >> +						&stt->iommu_tables, next) {
> >> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> >> +						stit->tbl, entry + i, gpa,
> >> +						iommu_tce_direction(tce));
> >> +					if (ret != H_SUCCESS)
> >> +						goto unlock_exit;
> >> +				}
> > 
> > Um.. what value will this for_each leave in stit after completion?  I
> > suspect it will be something bogus, which means re-using stit in the
> > next 0..npages loop iteration won't be safe (you only initialize stit
> > with the first entry outside that loop).
> 
> 
> #define list_for_each_entry_lockless(pos, head, member) \
>   for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
>      &pos->member != (head); \
>      pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
> 
> stit is "pos" which is reset every time the loop is called.

Um.. I'm not concerned about the access to stit within the
list_for_each().  It's the 'if (stit)' a few lines above I'm worried
about.

On the first iteration of the *outer* loop (for i=0..npages) stit has
been set correctly to list_first_entry_or_null().  But on subsequent
iteratoins of that outer loop, it has whatever value it has after the
completion of the list_for_each() in the previious iteration of the
outer loop.  I don't think it's wise to rely on what that value will
be.

Simplest fix would be to introduce a stit2 as the counter for the
inner loop.

> 
> 
> > 
> >> +			}
> >> +		}
> >> +
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >>  
> >> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> >> +					tce_value, npages))
> >> +			return H_PARAMETER;
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> >> +
> >> +			for (i = 0; i < npages; ++i) {
> >> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry + i);
> >> +				if (ret)
> >> +					return ret;
> > 
> > Again do you need some sort of cleanup for partial completion?
> 
> Again,
> Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
> 
> This is an unexpected failure which should not happen, what kind of cleanup
> it would make sense to do here? Re-map what was mapped before H_STUFF_TCE
> was called?

Ok, documenting to me the fact that it's a "can't happen" is one of
the reasons I like to see WARN_ON()s in those cases.

> 
> > 
> > 
> >> +			}
> >> +		}
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index dc1c66fda941..018c7d94a575 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>  
> >>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +	long ret;
> >> +
> >> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +	if (ret)
> >> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa = 0, ua;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> >> +		return H_PARAMETER;
> >> +
> >> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_HARDWARE;
> > 
> > What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?
> 
> 
> When kernel memory gets corrupted and vmalloc_to_page() won't be able to
> find a page which was allocated with vmalloc.

Ok, so again there should be a WARN_ON().

> 
> 
> >> +
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_HARDWARE;
> >> +
> >> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> >> +
> >>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	unsigned long entry, gpa;
> >> +	enum dma_data_direction dir;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		entry = ioba >> stit->tbl->it_page_shift;
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +		dir = iommu_tce_direction(tce);
> >> +
> >> +		if (dir == DMA_NONE) {
> >> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> >> +				return H_PARAMETER;
> >> +		} else {
> >> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> >> +				return H_PARAMETER;
> >> +		}
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			if (dir == DMA_NONE)
> >> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry);
> >> +			else
> >> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> >> +						entry, gpa, dir);
> >> +			if (ret != H_SUCCESS)
> >> +				return ret;
> >> +		}
> >> +	}
> >> +
> >>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >>  
> >>  	return H_SUCCESS;
> >> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS;
> >> -	unsigned long tces, entry, tce, ua = 0;
> >> +	unsigned long tces, entry, gpa, tce, ua = 0;
> >>  	unsigned long *rmap = NULL;
> >>  	bool prereg = false;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		}
> >>  	}
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +
> >>  	for (i = 0; i < npages; ++i) {
> >>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>  
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +
> >> +		if (stit) {
> >> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +			ret = iommu_tce_put_param_check(stit->tbl,
> >> +					ioba + (i << stit->tbl->it_page_shift),
> >> +					gpa);
> >> +			if (ret != H_SUCCESS)
> >> +				goto unlock_exit;
> >> +
> >> +		}
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >>  		tce = be64_to_cpu(((u64 *)tces)[i]);
> > 
> > As noted in the earlier patch this is really dangerous - by reloading
> > the tce from userspace you've thrown away the verification above.
> 
> 
> Sure, I am adding a tces cache to kvm_vcpu.

> 
> 
> >> +		if (stit) {
> >> +			for (i = 0; i < npages; ++i) {
> >> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +				list_for_each_entry_lockless(stit,
> >> +						&stt->iommu_tables, next) {
> >> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> >> +						stit->tbl, entry + i, gpa,
> >> +						iommu_tce_direction(tce));
> >> +					if (ret != H_SUCCESS)
> >> +						goto unlock_exit;
> >> +				}
> >> +			}
> >> +		}
> >> +
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >>  
> >> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> >> +			struct kvmppc_spapr_tce_iommu_table, next);
> >> +	if (stit) {
> >> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> >> +					tce_value, npages))
> >> +			return H_PARAMETER;
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> >> +
> >> +			for (i = 0; i < npages; ++i) {
> >> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> >> +						stit->tbl, entry + i);
> >> +				if (ret)
> >> +					return ret;
> >> +			}
> >> +		}
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >> index cd892dec7cb6..f3127dc87912 100644
> >> --- a/arch/powerpc/kvm/powerpc.c
> >> +++ b/arch/powerpc/kvm/powerpc.c
> >> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >> +		/* fallthrough */
> > 
> > I'm not sure why this one should get a fallthrough comment, when none
> > of the other cases do.
> 
> 
> I believe it was either ignored then or checkpatch.pl did not warn about
> this at the time.

Hm. Sounds like a bug in checkpatch.pl TBH.  Fall through after
executing code for one case definitely requires a comment IMO,
fallthrough from an empty label - i.e. where there's just a bunch of
different labels with the same code block doesn't require one, I feel.

> 
> 
> > 
> >> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>  	case KVM_CAP_PPC_ENABLE_HCALL:
> >> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >> index d32f239eb471..2b7dc22265fe 100644
> >> --- a/virt/kvm/vfio.c
> >> +++ b/virt/kvm/vfio.c
> >> @@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +#include <asm/kvm_ppc.h>
> >> +#endif
> >> +
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  
> >>  		mutex_unlock(&kv->lock);
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >>  
> >>  		kvm_vfio_group_put_external_user(vfio_group);
> >> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>  
> >>  		return ret;
> >> +
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> >> +		struct kvm_vfio_spapr_tce param;
> >> +		unsigned long minsz;
> >> +		struct kvm_vfio *kv = dev->private;
> >> +		struct vfio_group *vfio_group;
> >> +		struct kvm_vfio_group *kvg;
> >> +		struct fd f;
> >> +
> >> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> >> +
> >> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (param.argsz < minsz || param.flags)
> >> +			return -EINVAL;
> >> +
> >> +		f = fdget(param.groupfd);
> >> +		if (!f.file)
> >> +			return -EBADF;
> >> +
> >> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >> +		fdput(f);
> >> +
> >> +		if (IS_ERR(vfio_group))
> >> +			return PTR_ERR(vfio_group);
> >> +
> > 
> > 
> > Is there any particular reason you unwrap the group fd here, but the
> > table fd inside kvm__spapr_tce_attach_iommu_group()?
> 
> No particular reason, just an intention not to spread too much spapr to KVM
> VFIO device and vfio_group to POWER KVM.
>
> I only unwrapp table_fd to see if it is in the kvm->arch.spapr_tce_tables
> list, I am trying to keep spapr_tce_tables and kvmppc_spapr_tce_iommu_table
> local to arch/powerpc/kvm/book3s_64_vio*.c
> 
> Unwrapping groupfd in arch/powerpc/kvm/book3s_64_vio*.c would mean
> duplicating all kvm_vfio_group_get_external_user()/etc stubs in
> arch/powerpc/kvm/book3s_64_vio.c, I did not want to duplicate these stubs.
> I could but since I already have vfio_group unwrapped here, it seems
> pointless to unwrap it over again in arch/powerpc/kvm/book3s_64_vio.c,
> should I?

Ok, that seems like an adequate reason to do it this way.

> 
> 
> 
> > 
> >> +		ret = -ENOENT;
> >> +
> >> +		mutex_lock(&kv->lock);
> >> +
> >> +		list_for_each_entry(kvg, &kv->group_list, node) {
> >> +			if (kvg->vfio_group != vfio_group)
> >> +				continue;
> >> +
> >> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >> +					param.tablefd, vfio_group);
> >> +
> >> +			break;
> >> +		}
> >> +
> >> +		mutex_unlock(&kv->lock);
> >> +
> >> +		return ret;
> >> +	}
> >> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> >>  
> >>  	return -ENXIO;
> >> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> >> +#endif
> >>  			return 0;
> >>  		}
> >>  
> >> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >>  	struct kvm_vfio_group *kvg, *tmp;
> >>  
> >>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  		list_del(&kvg->node);
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
  2017-02-10  3:07         ` David Gibson
@ 2017-02-10  4:09           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-10  4:09 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 9053 bytes --]

On 10/02/17 14:07, David Gibson wrote:
> On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
>> On 09/02/17 14:51, David Gibson wrote:
>>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>>>> For the emulated devices it does not matter much if we get a broken TCE
>>>> half way handling a TCE list but for VFIO it will matter as it has
>>>> more chances to fail so we try to do our best and check as much as we
>>>> can before proceeding.
>>>>
>>>> This separates a guest view table update from validation. No change in
>>>> behavior is expected.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>>>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 15df8ae627d9..9a7b7fca5e84 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>  		if (ret != H_SUCCESS)
>>>>  			goto unlock_exit;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		if (get_user(tce, tces + i)) {
>>>> +			ret = H_TOO_HARD;
>>>> +			goto unlock_exit;
>>>> +		}
>>>> +		tce = be64_to_cpu(tce);
>>>
>>> This doesn't look safe.  The contents of user memory could change
>>> between the two get_user()s, meaning that you're no longer guaranteed
>>> a TCE loaded into kernel has been validated at all.
>>>
>>> I think you need to either:
>>>
>>>     a) Make sure things safe against a bad TCE being loaded into a TCE
>>>     table and move all validation to where the TCE is used, rather
>>>     than loaded
>>>
>>> or
>>>     b) Copy the whole set of indirect entries to a temporary in-kernel
>>>        buffer, then validate, then load into the actual TCE table.
>>
>>
>> Correct :( The problem is I do not know how far I want to go in reverting
>> the state as it was when I started handling H_PUT_TCE_INDIRECT.
>>
>> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
>> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
>> address.
>>
>>
>> To do a) I'll need to remember old content of each hardware table entry as
>> when I reach TCE#100, I'll need to revert to the initial state which means
>> I need to write back old TCEs to all affected hardware tables and update
>> reference counters of all affected preregistered areas. Well, the actual
>> tables must not have different addresses (BUG_ON? is it worth testing while
>> writing to hardware tables that values I am replacing are the same in all
>> tables?) so I can have just a single array of old TCEs from hardware tables
>> in vcpu.
> 
> I thought you said shared tables were disabled, so the two tables
> would have different addresses?

That would be 2 physically separated tables but the content would be the
same as long as they belong to the same VFIO container.


> 
> Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> only if the guest/qemu does something wrong, or can it fail for other
> reasons? 

This should always just work.

> What about in real mode vs. virtual mode?

Real mode is no different in this matter.

Real mode is different from virtual mode in 3 aspects:

1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
inhibited writes to invalidate "TCE kill" cache;

2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
lockdep does not work in real mode properly;

3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
addresses directly. Not expected to fail.

This is a full list.


> 
> I think the key to this approach will be to think carefully about what
> semantics you guarantee for mappings shadowed into the hardware
> tables.  For example, it might work to specify that the host mappings
> only match the GPA mappings if those GPA mapings are valid in the
> first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> to update the view of the table in terms of GPA.  But when you shadow
> those into the HPA tables, any entries which can't be translated you
> just replace with a cleared entry.

Literally with zero? Silently? WARN_ON_ONCE?

> That should be enough to protect
> the host.  Obviously you can expect the device to fail when you
> actually attempt to DMA there, but that's the guest's (or qemu's) own
> fault for putting bad addresses in the TCE table.
> 
> Obviously that might not be great for debugging, since mappings will
> appear to succeed, but then not work later on.
> 
> This does have the nice property that it's reasonably obvious what to
> do if you have some GPA mappings for emulated devices, then hotplug a
> VFIO device and at that point hit a gpa->hpa translation error.
> There's no hcall in this case, so there's no obvious way to return an
> error to the guest.

Right. So if I do this, you would probably even ack this? :)


> 
>> To do b) I'll need:
>>
>> 1. to have a copy of TCEs from the guest in vcpu,
> 
> I don't quite understand this.  You need a temporary copy, yes, but I
> don't see why it needs to be attached to the vcpu.


It does not need, I just need a safe + static + lock-free place for it as I
do not want to do malloc() in the TCE handlers and (in theory) multiple
CPUs can do concurrent TCE requests and I want to avoid locking especially
in realmode.


>> I populate it via
>> get_user() to make sure they won't change;
>> 2. an array of userspace addresses translated from given TCEs; and in order
>> to make sure these addresses won't go away, I'll need to reference each
>> preregistered memory area via mm_iommu_mapped_inc().
>>
>> When I reach TCE#100, I'll have to revert the change, i.e. call
>> mm_iommu_mapped_dec().
> 
> Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> the updated translations into a temp buffer.  Then you'd to make more
> temp buffers to store the UA and HPA translations (although maybe you
> could overwrite/reuse the original temp buffer if you're careful).
> Then only if all of those succeed do you copy them into the real
> hardware tables.
> 
> Which sounds like it might be kinda messy, at least in real mode.

So is it worth it?

> 
>> So I will end up having 2 arrays in a vcpu and simpler reverting code.
>>
>>
>> Or I can do simpler version of b) which would store guest TCEs in
>> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
>> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
>> request, some preregistered regions will stay referenced till the guest is
>> killed or rebooted (and this will prevent memory from unregistering) - but
>> this means no harm to the host;
> 
> Hrm.. that's not really true.  It's not the worst thing that can
> happen, but allowing the guest to permanently lock extra chunks of
> memory is a form of harm to the host.


These are the same preregistered chunks which are already locked. And the
lock is there till QEMU process is dead. What will not be possible is
memory hotunplug.


> 
>> and with preregistered RAM, there is no
>> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
>>
>>
>>
>> Which approach to pick?
>>
>>
>> LoPAPR says:
>> ===
>> If the TCE parameter represents the logical page address of a page that is
>> not valid for the calling partition, return
>> H_Parameter.
>> ===
>>
>>
>>
>>>>  
>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>  	}
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 918af76ab2b6..f8a54b7c788e 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret = H_SUCCESS;
>>>> -	unsigned long tces, entry, ua = 0;
>>>> +	unsigned long tces, entry, tce, ua = 0;
>>>>  	unsigned long *rmap = NULL;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  	}
>>>>  
>>>>  	for (i = 0; i < npages; ++i) {
>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>  
>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>  		if (ret != H_SUCCESS)
>>>>  			goto unlock_exit;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>
>>> Same problem here.
>>>
>>>>  
>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>  	}
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-10  4:09           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-10  4:09 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 9053 bytes --]

On 10/02/17 14:07, David Gibson wrote:
> On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
>> On 09/02/17 14:51, David Gibson wrote:
>>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>>>> For the emulated devices it does not matter much if we get a broken TCE
>>>> half way handling a TCE list but for VFIO it will matter as it has
>>>> more chances to fail so we try to do our best and check as much as we
>>>> can before proceeding.
>>>>
>>>> This separates a guest view table update from validation. No change in
>>>> behavior is expected.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>>>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 15df8ae627d9..9a7b7fca5e84 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>  		if (ret != H_SUCCESS)
>>>>  			goto unlock_exit;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		if (get_user(tce, tces + i)) {
>>>> +			ret = H_TOO_HARD;
>>>> +			goto unlock_exit;
>>>> +		}
>>>> +		tce = be64_to_cpu(tce);
>>>
>>> This doesn't look safe.  The contents of user memory could change
>>> between the two get_user()s, meaning that you're no longer guaranteed
>>> a TCE loaded into kernel has been validated at all.
>>>
>>> I think you need to either:
>>>
>>>     a) Make sure things safe against a bad TCE being loaded into a TCE
>>>     table and move all validation to where the TCE is used, rather
>>>     than loaded
>>>
>>> or
>>>     b) Copy the whole set of indirect entries to a temporary in-kernel
>>>        buffer, then validate, then load into the actual TCE table.
>>
>>
>> Correct :( The problem is I do not know how far I want to go in reverting
>> the state as it was when I started handling H_PUT_TCE_INDIRECT.
>>
>> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
>> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
>> address.
>>
>>
>> To do a) I'll need to remember old content of each hardware table entry as
>> when I reach TCE#100, I'll need to revert to the initial state which means
>> I need to write back old TCEs to all affected hardware tables and update
>> reference counters of all affected preregistered areas. Well, the actual
>> tables must not have different addresses (BUG_ON? is it worth testing while
>> writing to hardware tables that values I am replacing are the same in all
>> tables?) so I can have just a single array of old TCEs from hardware tables
>> in vcpu.
> 
> I thought you said shared tables were disabled, so the two tables
> would have different addresses?

That would be 2 physically separated tables but the content would be the
same as long as they belong to the same VFIO container.


> 
> Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> only if the guest/qemu does something wrong, or can it fail for other
> reasons? 

This should always just work.

> What about in real mode vs. virtual mode?

Real mode is no different in this matter.

Real mode is different from virtual mode in 3 aspects:

1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
inhibited writes to invalidate "TCE kill" cache;

2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
lockdep does not work in real mode properly;

3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
addresses directly. Not expected to fail.

This is a full list.


> 
> I think the key to this approach will be to think carefully about what
> semantics you guarantee for mappings shadowed into the hardware
> tables.  For example, it might work to specify that the host mappings
> only match the GPA mappings if those GPA mapings are valid in the
> first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> to update the view of the table in terms of GPA.  But when you shadow
> those into the HPA tables, any entries which can't be translated you
> just replace with a cleared entry.

Literally with zero? Silently? WARN_ON_ONCE?

> That should be enough to protect
> the host.  Obviously you can expect the device to fail when you
> actually attempt to DMA there, but that's the guest's (or qemu's) own
> fault for putting bad addresses in the TCE table.
> 
> Obviously that might not be great for debugging, since mappings will
> appear to succeed, but then not work later on.
> 
> This does have the nice property that it's reasonably obvious what to
> do if you have some GPA mappings for emulated devices, then hotplug a
> VFIO device and at that point hit a gpa->hpa translation error.
> There's no hcall in this case, so there's no obvious way to return an
> error to the guest.

Right. So if I do this, you would probably even ack this? :)


> 
>> To do b) I'll need:
>>
>> 1. to have a copy of TCEs from the guest in vcpu,
> 
> I don't quite understand this.  You need a temporary copy, yes, but I
> don't see why it needs to be attached to the vcpu.


It does not need, I just need a safe + static + lock-free place for it as I
do not want to do malloc() in the TCE handlers and (in theory) multiple
CPUs can do concurrent TCE requests and I want to avoid locking especially
in realmode.


>> I populate it via
>> get_user() to make sure they won't change;
>> 2. an array of userspace addresses translated from given TCEs; and in order
>> to make sure these addresses won't go away, I'll need to reference each
>> preregistered memory area via mm_iommu_mapped_inc().
>>
>> When I reach TCE#100, I'll have to revert the change, i.e. call
>> mm_iommu_mapped_dec().
> 
> Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> the updated translations into a temp buffer.  Then you'd to make more
> temp buffers to store the UA and HPA translations (although maybe you
> could overwrite/reuse the original temp buffer if you're careful).
> Then only if all of those succeed do you copy them into the real
> hardware tables.
> 
> Which sounds like it might be kinda messy, at least in real mode.

So is it worth it?

> 
>> So I will end up having 2 arrays in a vcpu and simpler reverting code.
>>
>>
>> Or I can do simpler version of b) which would store guest TCEs in
>> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
>> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
>> request, some preregistered regions will stay referenced till the guest is
>> killed or rebooted (and this will prevent memory from unregistering) - but
>> this means no harm to the host;
> 
> Hrm.. that's not really true.  It's not the worst thing that can
> happen, but allowing the guest to permanently lock extra chunks of
> memory is a form of harm to the host.


These are the same preregistered chunks which are already locked. And the
lock is there till QEMU process is dead. What will not be possible is
memory hotunplug.


> 
>> and with preregistered RAM, there is no
>> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
>>
>>
>>
>> Which approach to pick?
>>
>>
>> LoPAPR says:
>> ===
>> If the TCE parameter represents the logical page address of a page that is
>> not valid for the calling partition, return
>> H_Parameter.
>> ===
>>
>>
>>
>>>>  
>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>  	}
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 918af76ab2b6..f8a54b7c788e 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret = H_SUCCESS;
>>>> -	unsigned long tces, entry, ua = 0;
>>>> +	unsigned long tces, entry, tce, ua = 0;
>>>>  	unsigned long *rmap = NULL;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  	}
>>>>  
>>>>  	for (i = 0; i < npages; ++i) {
>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>  
>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>  		if (ret != H_SUCCESS)
>>>>  			goto unlock_exit;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages; ++i) {
>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>
>>> Same problem here.
>>>
>>>>  
>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>  	}
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
  2017-02-10  4:09           ` Alexey Kardashevskiy
  (?)
@ 2017-02-10  4:50             ` David Gibson
  -1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-10  4:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 10993 bytes --]

On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
> On 10/02/17 14:07, David Gibson wrote:
> > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> >> On 09/02/17 14:51, David Gibson wrote:
> >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >>>> For the emulated devices it does not matter much if we get a broken TCE
> >>>> half way handling a TCE list but for VFIO it will matter as it has
> >>>> more chances to fail so we try to do our best and check as much as we
> >>>> can before proceeding.
> >>>>
> >>>> This separates a guest view table update from validation. No change in
> >>>> behavior is expected.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>>>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..9a7b7fca5e84 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		if (get_user(tce, tces + i)) {
> >>>> +			ret = H_TOO_HARD;
> >>>> +			goto unlock_exit;
> >>>> +		}
> >>>> +		tce = be64_to_cpu(tce);
> >>>
> >>> This doesn't look safe.  The contents of user memory could change
> >>> between the two get_user()s, meaning that you're no longer guaranteed
> >>> a TCE loaded into kernel has been validated at all.
> >>>
> >>> I think you need to either:
> >>>
> >>>     a) Make sure things safe against a bad TCE being loaded into a TCE
> >>>     table and move all validation to where the TCE is used, rather
> >>>     than loaded
> >>>
> >>> or
> >>>     b) Copy the whole set of indirect entries to a temporary in-kernel
> >>>        buffer, then validate, then load into the actual TCE table.
> >>
> >>
> >> Correct :( The problem is I do not know how far I want to go in reverting
> >> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> >>
> >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> >> address.
> >>
> >>
> >> To do a) I'll need to remember old content of each hardware table entry as
> >> when I reach TCE#100, I'll need to revert to the initial state which means
> >> I need to write back old TCEs to all affected hardware tables and update
> >> reference counters of all affected preregistered areas. Well, the actual
> >> tables must not have different addresses (BUG_ON? is it worth testing while
> >> writing to hardware tables that values I am replacing are the same in all
> >> tables?) so I can have just a single array of old TCEs from hardware tables
> >> in vcpu.
> > 
> > I thought you said shared tables were disabled, so the two tables
> > would have different addresses?
> 
> That would be 2 physically separated tables but the content would be the
> same as long as they belong to the same VFIO container.

Ok.  I thought you were talking about the address of the TCE tables
being the same above.  Did you mean the address of an individual page
mapped in the TCE table?

> > Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> > only if the guest/qemu does something wrong, or can it fail for other
> > reasons? 
> 
> This should always just work.

Ok, given that, just replacing HPAs we can't translate with a clear
entry seems fine to me.

> > What about in real mode vs. virtual mode?
> 
> Real mode is no different in this matter.
> 
> Real mode is different from virtual mode in 3 aspects:
> 
> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
> inhibited writes to invalidate "TCE kill" cache;
> 
> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
> lockdep does not work in real mode properly;
> 
> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
> addresses directly. Not expected to fail.
> 
> This is a full list.

Ok.

> > I think the key to this approach will be to think carefully about what
> > semantics you guarantee for mappings shadowed into the hardware
> > tables.  For example, it might work to specify that the host mappings
> > only match the GPA mappings if those GPA mapings are valid in the
> > first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> > to update the view of the table in terms of GPA.  But when you shadow
> > those into the HPA tables, any entries which can't be translated you
> > just replace with a cleared entry.
> 
> Literally with zero? Silently? WARN_ON_ONCE?

Well, with a no-permission TCE, which might as well be zero, yes.

WARN_ON_ONCE() is probably a good idea.

> > That should be enough to protect
> > the host.  Obviously you can expect the device to fail when you
> > actually attempt to DMA there, but that's the guest's (or qemu's) own
> > fault for putting bad addresses in the TCE table.
> > 
> > Obviously that might not be great for debugging, since mappings will
> > appear to succeed, but then not work later on.
> > 
> > This does have the nice property that it's reasonably obvious what to
> > do if you have some GPA mappings for emulated devices, then hotplug a
> > VFIO device and at that point hit a gpa->hpa translation error.
> > There's no hcall in this case, so there's no obvious way to return an
> > error to the guest.
> 
> Right. So if I do this, you would probably even ack this? :)

Assuming I don't spot some other showstopper...

Oh.. one thing to make sure you think about though: what happens if a
guest makes some mappings, then there's a memory hotplug event which
changes the set of valid GPAs?  In particular what if you hot unplug
some memory which is mapped in a guest TCE table?  You might have to
regenerate the HPA tables from the GPA table on hot unplug (unless you
have a way of locking out an unplug event while that piece of guest
ram is TCE mapped).

> >> To do b) I'll need:
> >>
> >> 1. to have a copy of TCEs from the guest in vcpu,
> > 
> > I don't quite understand this.  You need a temporary copy, yes, but I
> > don't see why it needs to be attached to the vcpu.
> 
> It does not need, I just need a safe + static + lock-free place for it as I
> do not want to do malloc() in the TCE handlers and (in theory) multiple
> CPUs can do concurrent TCE requests and I want to avoid locking especially
> in realmode.

Ah, right, it's the inability to malloc() that's the difficulty.  You
could put it in the vcpu, or you could use a per-(host)-cpu area - you
can't switch guests while in a realmode handler.
> 
> 
> >> I populate it via
> >> get_user() to make sure they won't change;
> >> 2. an array of userspace addresses translated from given TCEs; and in order
> >> to make sure these addresses won't go away, I'll need to reference each
> >> preregistered memory area via mm_iommu_mapped_inc().
> >>
> >> When I reach TCE#100, I'll have to revert the change, i.e. call
> >> mm_iommu_mapped_dec().
> > 
> > Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> > the updated translations into a temp buffer.  Then you'd to make more
> > temp buffers to store the UA and HPA translations (although maybe you
> > could overwrite/reuse the original temp buffer if you're careful).
> > Then only if all of those succeed do you copy them into the real
> > hardware tables.
> > 
> > Which sounds like it might be kinda messy, at least in real mode.
> 
> So is it worth it?

Option (a) is certainly looking better to me based on current
information.

> >> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> >>
> >>
> >> Or I can do simpler version of b) which would store guest TCEs in
> >> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> >> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> >> request, some preregistered regions will stay referenced till the guest is
> >> killed or rebooted (and this will prevent memory from unregistering) - but
> >> this means no harm to the host;
> > 
> > Hrm.. that's not really true.  It's not the worst thing that can
> > happen, but allowing the guest to permanently lock extra chunks of
> > memory is a form of harm to the host.
> 
> 
> These are the same preregistered chunks which are already locked. And the
> lock is there till QEMU process is dead. What will not be possible is
> memory hotunplug.

Ah, ok, I see your point.  That's probably sufficient, but option (a)
is still looking better.

> >> and with preregistered RAM, there is no
> >> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> >>
> >>
> >>
> >> Which approach to pick?
> >>
> >>
> >> LoPAPR says:
> >> ===
> >> If the TCE parameter represents the logical page address of a page that is
> >> not valid for the calling partition, return
> >> H_Parameter.
> >> ===
> >>
> >>
> >>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 918af76ab2b6..f8a54b7c788e 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret = H_SUCCESS;
> >>>> -	unsigned long tces, entry, ua = 0;
> >>>> +	unsigned long tces, entry, tce, ua = 0;
> >>>>  	unsigned long *rmap = NULL;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>  
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>
> >>> Same problem here.
> >>>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-10  4:50             ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-10  4:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 10993 bytes --]

On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
> On 10/02/17 14:07, David Gibson wrote:
> > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> >> On 09/02/17 14:51, David Gibson wrote:
> >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >>>> For the emulated devices it does not matter much if we get a broken TCE
> >>>> half way handling a TCE list but for VFIO it will matter as it has
> >>>> more chances to fail so we try to do our best and check as much as we
> >>>> can before proceeding.
> >>>>
> >>>> This separates a guest view table update from validation. No change in
> >>>> behavior is expected.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>>>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..9a7b7fca5e84 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		if (get_user(tce, tces + i)) {
> >>>> +			ret = H_TOO_HARD;
> >>>> +			goto unlock_exit;
> >>>> +		}
> >>>> +		tce = be64_to_cpu(tce);
> >>>
> >>> This doesn't look safe.  The contents of user memory could change
> >>> between the two get_user()s, meaning that you're no longer guaranteed
> >>> a TCE loaded into kernel has been validated at all.
> >>>
> >>> I think you need to either:
> >>>
> >>>     a) Make sure things safe against a bad TCE being loaded into a TCE
> >>>     table and move all validation to where the TCE is used, rather
> >>>     than loaded
> >>>
> >>> or
> >>>     b) Copy the whole set of indirect entries to a temporary in-kernel
> >>>        buffer, then validate, then load into the actual TCE table.
> >>
> >>
> >> Correct :( The problem is I do not know how far I want to go in reverting
> >> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> >>
> >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> >> address.
> >>
> >>
> >> To do a) I'll need to remember old content of each hardware table entry as
> >> when I reach TCE#100, I'll need to revert to the initial state which means
> >> I need to write back old TCEs to all affected hardware tables and update
> >> reference counters of all affected preregistered areas. Well, the actual
> >> tables must not have different addresses (BUG_ON? is it worth testing while
> >> writing to hardware tables that values I am replacing are the same in all
> >> tables?) so I can have just a single array of old TCEs from hardware tables
> >> in vcpu.
> > 
> > I thought you said shared tables were disabled, so the two tables
> > would have different addresses?
> 
> That would be 2 physically separated tables but the content would be the
> same as long as they belong to the same VFIO container.

Ok.  I thought you were talking about the address of the TCE tables
being the same above.  Did you mean the address of an individual page
mapped in the TCE table?

> > Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> > only if the guest/qemu does something wrong, or can it fail for other
> > reasons? 
> 
> This should always just work.

Ok, given that, just replacing HPAs we can't translate with a clear
entry seems fine to me.

> > What about in real mode vs. virtual mode?
> 
> Real mode is no different in this matter.
> 
> Real mode is different from virtual mode in 3 aspects:
> 
> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
> inhibited writes to invalidate "TCE kill" cache;
> 
> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
> lockdep does not work in real mode properly;
> 
> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
> addresses directly. Not expected to fail.
> 
> This is a full list.

Ok.

> > I think the key to this approach will be to think carefully about what
> > semantics you guarantee for mappings shadowed into the hardware
> > tables.  For example, it might work to specify that the host mappings
> > only match the GPA mappings if those GPA mapings are valid in the
> > first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> > to update the view of the table in terms of GPA.  But when you shadow
> > those into the HPA tables, any entries which can't be translated you
> > just replace with a cleared entry.
> 
> Literally with zero? Silently? WARN_ON_ONCE?

Well, with a no-permission TCE, which might as well be zero, yes.

WARN_ON_ONCE() is probably a good idea.

> > That should be enough to protect
> > the host.  Obviously you can expect the device to fail when you
> > actually attempt to DMA there, but that's the guest's (or qemu's) own
> > fault for putting bad addresses in the TCE table.
> > 
> > Obviously that might not be great for debugging, since mappings will
> > appear to succeed, but then not work later on.
> > 
> > This does have the nice property that it's reasonably obvious what to
> > do if you have some GPA mappings for emulated devices, then hotplug a
> > VFIO device and at that point hit a gpa->hpa translation error.
> > There's no hcall in this case, so there's no obvious way to return an
> > error to the guest.
> 
> Right. So if I do this, you would probably even ack this? :)

Assuming I don't spot some other showstopper...

Oh.. one thing to make sure you think about though: what happens if a
guest makes some mappings, then there's a memory hotplug event which
changes the set of valid GPAs?  In particular what if you hot unplug
some memory which is mapped in a guest TCE table?  You might have to
regenerate the HPA tables from the GPA table on hot unplug (unless you
have a way of locking out an unplug event while that piece of guest
ram is TCE mapped).

> >> To do b) I'll need:
> >>
> >> 1. to have a copy of TCEs from the guest in vcpu,
> > 
> > I don't quite understand this.  You need a temporary copy, yes, but I
> > don't see why it needs to be attached to the vcpu.
> 
> It does not need, I just need a safe + static + lock-free place for it as I
> do not want to do malloc() in the TCE handlers and (in theory) multiple
> CPUs can do concurrent TCE requests and I want to avoid locking especially
> in realmode.

Ah, right, it's the inability to malloc() that's the difficulty.  You
could put it in the vcpu, or you could use a per-(host)-cpu area - you
can't switch guests while in a realmode handler.
> 
> 
> >> I populate it via
> >> get_user() to make sure they won't change;
> >> 2. an array of userspace addresses translated from given TCEs; and in order
> >> to make sure these addresses won't go away, I'll need to reference each
> >> preregistered memory area via mm_iommu_mapped_inc().
> >>
> >> When I reach TCE#100, I'll have to revert the change, i.e. call
> >> mm_iommu_mapped_dec().
> > 
> > Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> > the updated translations into a temp buffer.  Then you'd to make more
> > temp buffers to store the UA and HPA translations (although maybe you
> > could overwrite/reuse the original temp buffer if you're careful).
> > Then only if all of those succeed do you copy them into the real
> > hardware tables.
> > 
> > Which sounds like it might be kinda messy, at least in real mode.
> 
> So is it worth it?

Option (a) is certainly looking better to me based on current
information.

> >> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> >>
> >>
> >> Or I can do simpler version of b) which would store guest TCEs in
> >> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> >> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> >> request, some preregistered regions will stay referenced till the guest is
> >> killed or rebooted (and this will prevent memory from unregistering) - but
> >> this means no harm to the host;
> > 
> > Hrm.. that's not really true.  It's not the worst thing that can
> > happen, but allowing the guest to permanently lock extra chunks of
> > memory is a form of harm to the host.
> 
> 
> These are the same preregistered chunks which are already locked. And the
> lock is there till QEMU process is dead. What will not be possible is
> memory hotunplug.

Ah, ok, I see your point.  That's probably sufficient, but option (a)
is still looking better.

> >> and with preregistered RAM, there is no
> >> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> >>
> >>
> >>
> >> Which approach to pick?
> >>
> >>
> >> LoPAPR says:
> >> ===
> >> If the TCE parameter represents the logical page address of a page that is
> >> not valid for the calling partition, return
> >> H_Parameter.
> >> ===
> >>
> >>
> >>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 918af76ab2b6..f8a54b7c788e 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret = H_SUCCESS;
> >>>> -	unsigned long tces, entry, ua = 0;
> >>>> +	unsigned long tces, entry, tce, ua = 0;
> >>>>  	unsigned long *rmap = NULL;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>  
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>
> >>> Same problem here.
> >>>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-10  4:50             ` David Gibson
  0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2017-02-10  4:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, linuxppc-dev, kvm, kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 10993 bytes --]

On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
> On 10/02/17 14:07, David Gibson wrote:
> > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> >> On 09/02/17 14:51, David Gibson wrote:
> >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >>>> For the emulated devices it does not matter much if we get a broken TCE
> >>>> half way handling a TCE list but for VFIO it will matter as it has
> >>>> more chances to fail so we try to do our best and check as much as we
> >>>> can before proceeding.
> >>>>
> >>>> This separates a guest view table update from validation. No change in
> >>>> behavior is expected.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>>>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..9a7b7fca5e84 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		if (get_user(tce, tces + i)) {
> >>>> +			ret = H_TOO_HARD;
> >>>> +			goto unlock_exit;
> >>>> +		}
> >>>> +		tce = be64_to_cpu(tce);
> >>>
> >>> This doesn't look safe.  The contents of user memory could change
> >>> between the two get_user()s, meaning that you're no longer guaranteed
> >>> a TCE loaded into kernel has been validated at all.
> >>>
> >>> I think you need to either:
> >>>
> >>>     a) Make sure things safe against a bad TCE being loaded into a TCE
> >>>     table and move all validation to where the TCE is used, rather
> >>>     than loaded
> >>>
> >>> or
> >>>     b) Copy the whole set of indirect entries to a temporary in-kernel
> >>>        buffer, then validate, then load into the actual TCE table.
> >>
> >>
> >> Correct :( The problem is I do not know how far I want to go in reverting
> >> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> >>
> >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> >> address.
> >>
> >>
> >> To do a) I'll need to remember old content of each hardware table entry as
> >> when I reach TCE#100, I'll need to revert to the initial state which means
> >> I need to write back old TCEs to all affected hardware tables and update
> >> reference counters of all affected preregistered areas. Well, the actual
> >> tables must not have different addresses (BUG_ON? is it worth testing while
> >> writing to hardware tables that values I am replacing are the same in all
> >> tables?) so I can have just a single array of old TCEs from hardware tables
> >> in vcpu.
> > 
> > I thought you said shared tables were disabled, so the two tables
> > would have different addresses?
> 
> That would be 2 physically separated tables but the content would be the
> same as long as they belong to the same VFIO container.

Ok.  I thought you were talking about the address of the TCE tables
being the same above.  Did you mean the address of an individual page
mapped in the TCE table?

> > Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> > only if the guest/qemu does something wrong, or can it fail for other
> > reasons? 
> 
> This should always just work.

Ok, given that, just replacing HPAs we can't translate with a clear
entry seems fine to me.

> > What about in real mode vs. virtual mode?
> 
> Real mode is no different in this matter.
> 
> Real mode is different from virtual mode in 3 aspects:
> 
> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
> inhibited writes to invalidate "TCE kill" cache;
> 
> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
> lockdep does not work in real mode properly;
> 
> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
> addresses directly. Not expected to fail.
> 
> This is a full list.

Ok.

> > I think the key to this approach will be to think carefully about what
> > semantics you guarantee for mappings shadowed into the hardware
> > tables.  For example, it might work to specify that the host mappings
> > only match the GPA mappings if those GPA mapings are valid in the
> > first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> > to update the view of the table in terms of GPA.  But when you shadow
> > those into the HPA tables, any entries which can't be translated you
> > just replace with a cleared entry.
> 
> Literally with zero? Silently? WARN_ON_ONCE?

Well, with a no-permission TCE, which might as well be zero, yes.

WARN_ON_ONCE() is probably a good idea.

> > That should be enough to protect
> > the host.  Obviously you can expect the device to fail when you
> > actually attempt to DMA there, but that's the guest's (or qemu's) own
> > fault for putting bad addresses in the TCE table.
> > 
> > Obviously that might not be great for debugging, since mappings will
> > appear to succeed, but then not work later on.
> > 
> > This does have the nice property that it's reasonably obvious what to
> > do if you have some GPA mappings for emulated devices, then hotplug a
> > VFIO device and at that point hit a gpa->hpa translation error.
> > There's no hcall in this case, so there's no obvious way to return an
> > error to the guest.
> 
> Right. So if I do this, you would probably even ack this? :)

Assuming I don't spot some other showstopper...

Oh.. one thing to make sure you think about though: what happens if a
guest makes some mappings, then there's a memory hotplug event which
changes the set of valid GPAs?  In particular what if you hot unplug
some memory which is mapped in a guest TCE table?  You might have to
regenerate the HPA tables from the GPA table on hot unplug (unless you
have a way of locking out an unplug event while that piece of guest
ram is TCE mapped).

> >> To do b) I'll need:
> >>
> >> 1. to have a copy of TCEs from the guest in vcpu,
> > 
> > I don't quite understand this.  You need a temporary copy, yes, but I
> > don't see why it needs to be attached to the vcpu.
> 
> It does not need, I just need a safe + static + lock-free place for it as I
> do not want to do malloc() in the TCE handlers and (in theory) multiple
> CPUs can do concurrent TCE requests and I want to avoid locking especially
> in realmode.

Ah, right, it's the inability to malloc() that's the difficulty.  You
could put it in the vcpu, or you could use a per-(host)-cpu area - you
can't switch guests while in a realmode handler.
> 
> 
> >> I populate it via
> >> get_user() to make sure they won't change;
> >> 2. an array of userspace addresses translated from given TCEs; and in order
> >> to make sure these addresses won't go away, I'll need to reference each
> >> preregistered memory area via mm_iommu_mapped_inc().
> >>
> >> When I reach TCE#100, I'll have to revert the change, i.e. call
> >> mm_iommu_mapped_dec().
> > 
> > Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> > the updated translations into a temp buffer.  Then you'd to make more
> > temp buffers to store the UA and HPA translations (although maybe you
> > could overwrite/reuse the original temp buffer if you're careful).
> > Then only if all of those succeed do you copy them into the real
> > hardware tables.
> > 
> > Which sounds like it might be kinda messy, at least in real mode.
> 
> So is it worth it?

Option (a) is certainly looking better to me based on current
information.

> >> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> >>
> >>
> >> Or I can do simpler version of b) which would store guest TCEs in
> >> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> >> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> >> request, some preregistered regions will stay referenced till the guest is
> >> killed or rebooted (and this will prevent memory from unregistering) - but
> >> this means no harm to the host;
> > 
> > Hrm.. that's not really true.  It's not the worst thing that can
> > happen, but allowing the guest to permanently lock extra chunks of
> > memory is a form of harm to the host.
> 
> 
> These are the same preregistered chunks which are already locked. And the
> lock is there till QEMU process is dead. What will not be possible is
> memory hotunplug.

Ah, ok, I see your point.  That's probably sufficient, but option (a)
is still looking better.

> >> and with preregistered RAM, there is no
> >> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> >>
> >>
> >>
> >> Which approach to pick?
> >>
> >>
> >> LoPAPR says:
> >> ===
> >> If the TCE parameter represents the logical page address of a page that is
> >> not valid for the calling partition, return
> >> H_Parameter.
> >> ===
> >>
> >>
> >>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 918af76ab2b6..f8a54b7c788e 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret = H_SUCCESS;
> >>>> -	unsigned long tces, entry, ua = 0;
> >>>> +	unsigned long tces, entry, tce, ua = 0;
> >>>>  	unsigned long *rmap = NULL;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>  
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>
> >>> Same problem here.
> >>>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
  2017-02-10  4:50             ` David Gibson
@ 2017-02-10  7:58               ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-10  7:58 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 11556 bytes --]

On 10/02/17 15:50, David Gibson wrote:
> On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
>> On 10/02/17 14:07, David Gibson wrote:
>>> On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
>>>> On 09/02/17 14:51, David Gibson wrote:
>>>>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>>>>>> For the emulated devices it does not matter much if we get a broken TCE
>>>>>> half way handling a TCE list but for VFIO it will matter as it has
>>>>>> more chances to fail so we try to do our best and check as much as we
>>>>>> can before proceeding.
>>>>>>
>>>>>> This separates a guest view table update from validation. No change in
>>>>>> behavior is expected.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>>>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>>>>>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> index 15df8ae627d9..9a7b7fca5e84 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>>>  		if (ret != H_SUCCESS)
>>>>>>  			goto unlock_exit;
>>>>>> +	}
>>>>>> +
>>>>>> +	for (i = 0; i < npages; ++i) {
>>>>>> +		if (get_user(tce, tces + i)) {
>>>>>> +			ret = H_TOO_HARD;
>>>>>> +			goto unlock_exit;
>>>>>> +		}
>>>>>> +		tce = be64_to_cpu(tce);
>>>>>
>>>>> This doesn't look safe.  The contents of user memory could change
>>>>> between the two get_user()s, meaning that you're no longer guaranteed
>>>>> a TCE loaded into kernel has been validated at all.
>>>>>
>>>>> I think you need to either:
>>>>>
>>>>>     a) Make sure things safe against a bad TCE being loaded into a TCE
>>>>>     table and move all validation to where the TCE is used, rather
>>>>>     than loaded
>>>>>
>>>>> or
>>>>>     b) Copy the whole set of indirect entries to a temporary in-kernel
>>>>>        buffer, then validate, then load into the actual TCE table.
>>>>
>>>>
>>>> Correct :( The problem is I do not know how far I want to go in reverting
>>>> the state as it was when I started handling H_PUT_TCE_INDIRECT.
>>>>
>>>> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
>>>> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
>>>> address.
>>>>
>>>>
>>>> To do a) I'll need to remember old content of each hardware table entry as
>>>> when I reach TCE#100, I'll need to revert to the initial state which means
>>>> I need to write back old TCEs to all affected hardware tables and update
>>>> reference counters of all affected preregistered areas. Well, the actual
>>>> tables must not have different addresses (BUG_ON? is it worth testing while
>>>> writing to hardware tables that values I am replacing are the same in all
>>>> tables?) so I can have just a single array of old TCEs from hardware tables
>>>> in vcpu.
>>>
>>> I thought you said shared tables were disabled, so the two tables
>>> would have different addresses?
>>
>> That would be 2 physically separated tables but the content would be the
>> same as long as they belong to the same VFIO container.
> 
> Ok.  I thought you were talking about the address of the TCE tables
> being the same above.

No, the example uses 2 separate TCE tables.

> Did you mean the address of an individual page
> mapped in the TCE table?

I meant the tables themselves are separate in the host memory but their
content is the same.


>>> Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
>>> only if the guest/qemu does something wrong, or can it fail for other
>>> reasons? 
>>
>> This should always just work.
> 
> Ok, given that, just replacing HPAs we can't translate with a clear
> entry seems fine to me.


Ok.


>>> What about in real mode vs. virtual mode?
>>
>> Real mode is no different in this matter.
>>
>> Real mode is different from virtual mode in 3 aspects:
>>
>> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
>> inhibited writes to invalidate "TCE kill" cache;
>>
>> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
>> lockdep does not work in real mode properly;
>>
>> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
>> addresses directly. Not expected to fail.
>>
>> This is a full list.
> 
> Ok.
> 
>>> I think the key to this approach will be to think carefully about what
>>> semantics you guarantee for mappings shadowed into the hardware
>>> tables.  For example, it might work to specify that the host mappings
>>> only match the GPA mappings if those GPA mapings are valid in the
>>> first place.  So, H_PUT_TCE etc. would succeed as long as they're able
>>> to update the view of the table in terms of GPA.  But when you shadow
>>> those into the HPA tables, any entries which can't be translated you
>>> just replace with a cleared entry.
>>
>> Literally with zero? Silently? WARN_ON_ONCE?
> 
> Well, with a no-permission TCE, which might as well be zero, yes.
> 
> WARN_ON_ONCE() is probably a good idea.
> 
>>> That should be enough to protect
>>> the host.  Obviously you can expect the device to fail when you
>>> actually attempt to DMA there, but that's the guest's (or qemu's) own
>>> fault for putting bad addresses in the TCE table.
>>>
>>> Obviously that might not be great for debugging, since mappings will
>>> appear to succeed, but then not work later on.
>>>
>>> This does have the nice property that it's reasonably obvious what to
>>> do if you have some GPA mappings for emulated devices, then hotplug a
>>> VFIO device and at that point hit a gpa->hpa translation error.
>>> There's no hcall in this case, so there's no obvious way to return an
>>> error to the guest.
>>
>> Right. So if I do this, you would probably even ack this? :)
> 
> Assuming I don't spot some other showstopper...
> 
> Oh.. one thing to make sure you think about though: what happens if a
> guest makes some mappings, then there's a memory hotplug event which
> changes the set of valid GPAs?  In particular what if you hot unplug
> some memory which is mapped in a guest TCE table?  You might have to
> regenerate the HPA tables from the GPA table on hot unplug (unless you
> have a way of locking out an unplug event while that piece of guest
> ram is TCE mapped).


The guest is expected to clear TCE table. Then QEMU will delete regions
which will invoke unregistration of previously registered memory and if the
guest failed to clear TCE table, these preregistered pages will remain
pinned, this is what is that mm_iommu_mapped_inc/dec is about.



>>>> To do b) I'll need:
>>>>
>>>> 1. to have a copy of TCEs from the guest in vcpu,
>>>
>>> I don't quite understand this.  You need a temporary copy, yes, but I
>>> don't see why it needs to be attached to the vcpu.
>>
>> It does not need, I just need a safe + static + lock-free place for it as I
>> do not want to do malloc() in the TCE handlers and (in theory) multiple
>> CPUs can do concurrent TCE requests and I want to avoid locking especially
>> in realmode.
> 
> Ah, right, it's the inability to malloc() that's the difficulty.  You
> could put it in the vcpu, or you could use a per-(host)-cpu area - you
> can't switch guests while in a realmode handler.


vcpu looks as a safe choice, it is just a bit annoying that each CPU will
use 4K for something which most likely won't be used though.



>>
>>
>>>> I populate it via
>>>> get_user() to make sure they won't change;
>>>> 2. an array of userspace addresses translated from given TCEs; and in order
>>>> to make sure these addresses won't go away, I'll need to reference each
>>>> preregistered memory area via mm_iommu_mapped_inc().
>>>>
>>>> When I reach TCE#100, I'll have to revert the change, i.e. call
>>>> mm_iommu_mapped_dec().
>>>
>>> Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
>>> the updated translations into a temp buffer.  Then you'd to make more
>>> temp buffers to store the UA and HPA translations (although maybe you
>>> could overwrite/reuse the original temp buffer if you're careful).
>>> Then only if all of those succeed do you copy them into the real
>>> hardware tables.
>>>
>>> Which sounds like it might be kinda messy, at least in real mode.
>>
>> So is it worth it?
> 
> Option (a) is certainly looking better to me based on current
> information.
> 
>>>> So I will end up having 2 arrays in a vcpu and simpler reverting code.
>>>>
>>>>
>>>> Or I can do simpler version of b) which would store guest TCEs in
>>>> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
>>>> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
>>>> request, some preregistered regions will stay referenced till the guest is
>>>> killed or rebooted (and this will prevent memory from unregistering) - but
>>>> this means no harm to the host;
>>>
>>> Hrm.. that's not really true.  It's not the worst thing that can
>>> happen, but allowing the guest to permanently lock extra chunks of
>>> memory is a form of harm to the host.
>>
>>
>> These are the same preregistered chunks which are already locked. And the
>> lock is there till QEMU process is dead. What will not be possible is
>> memory hotunplug.
> 
> Ah, ok, I see your point.  That's probably sufficient, but option (a)
> is still looking better.
> 
>>>> and with preregistered RAM, there is no
>>>> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
>>>>
>>>>
>>>>
>>>> Which approach to pick?
>>>>
>>>>
>>>> LoPAPR says:
>>>> ===
>>>> If the TCE parameter represents the logical page address of a page that is
>>>> not valid for the calling partition, return
>>>> H_Parameter.
>>>> ===
>>>>
>>>>
>>>>
>>>>>>  
>>>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>>>  	}
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> index 918af76ab2b6..f8a54b7c788e 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  {
>>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>>  	long i, ret = H_SUCCESS;
>>>>>> -	unsigned long tces, entry, ua = 0;
>>>>>> +	unsigned long tces, entry, tce, ua = 0;
>>>>>>  	unsigned long *rmap = NULL;
>>>>>>  
>>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  	}
>>>>>>  
>>>>>>  	for (i = 0; i < npages; ++i) {
>>>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>>  
>>>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>>>  		if (ret != H_SUCCESS)
>>>>>>  			goto unlock_exit;
>>>>>> +	}
>>>>>> +
>>>>>> +	for (i = 0; i < npages; ++i) {
>>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>
>>>>> Same problem here.
>>>>>
>>>>>>  
>>>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>>>  	}
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
@ 2017-02-10  7:58               ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2017-02-10  7:58 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 11556 bytes --]

On 10/02/17 15:50, David Gibson wrote:
> On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
>> On 10/02/17 14:07, David Gibson wrote:
>>> On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
>>>> On 09/02/17 14:51, David Gibson wrote:
>>>>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>>>>>> For the emulated devices it does not matter much if we get a broken TCE
>>>>>> half way handling a TCE list but for VFIO it will matter as it has
>>>>>> more chances to fail so we try to do our best and check as much as we
>>>>>> can before proceeding.
>>>>>>
>>>>>> This separates a guest view table update from validation. No change in
>>>>>> behavior is expected.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>>>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>>>>>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> index 15df8ae627d9..9a7b7fca5e84 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>>>  		if (ret != H_SUCCESS)
>>>>>>  			goto unlock_exit;
>>>>>> +	}
>>>>>> +
>>>>>> +	for (i = 0; i < npages; ++i) {
>>>>>> +		if (get_user(tce, tces + i)) {
>>>>>> +			ret = H_TOO_HARD;
>>>>>> +			goto unlock_exit;
>>>>>> +		}
>>>>>> +		tce = be64_to_cpu(tce);
>>>>>
>>>>> This doesn't look safe.  The contents of user memory could change
>>>>> between the two get_user()s, meaning that you're no longer guaranteed
>>>>> a TCE loaded into kernel has been validated at all.
>>>>>
>>>>> I think you need to either:
>>>>>
>>>>>     a) Make sure things safe against a bad TCE being loaded into a TCE
>>>>>     table and move all validation to where the TCE is used, rather
>>>>>     than loaded
>>>>>
>>>>> or
>>>>>     b) Copy the whole set of indirect entries to a temporary in-kernel
>>>>>        buffer, then validate, then load into the actual TCE table.
>>>>
>>>>
>>>> Correct :( The problem is I do not know how far I want to go in reverting
>>>> the state as it was when I started handling H_PUT_TCE_INDIRECT.
>>>>
>>>> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
>>>> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
>>>> address.
>>>>
>>>>
>>>> To do a) I'll need to remember old content of each hardware table entry as
>>>> when I reach TCE#100, I'll need to revert to the initial state which means
>>>> I need to write back old TCEs to all affected hardware tables and update
>>>> reference counters of all affected preregistered areas. Well, the actual
>>>> tables must not have different addresses (BUG_ON? is it worth testing while
>>>> writing to hardware tables that values I am replacing are the same in all
>>>> tables?) so I can have just a single array of old TCEs from hardware tables
>>>> in vcpu.
>>>
>>> I thought you said shared tables were disabled, so the two tables
>>> would have different addresses?
>>
>> That would be 2 physically separated tables but the content would be the
>> same as long as they belong to the same VFIO container.
> 
> Ok.  I thought you were talking about the address of the TCE tables
> being the same above.

No, the example uses 2 separate TCE tables.

> Did you mean the address of an individual page
> mapped in the TCE table?

I meant the tables themselves are separate in the host memory but their
content is the same.


>>> Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
>>> only if the guest/qemu does something wrong, or can it fail for other
>>> reasons? 
>>
>> This should always just work.
> 
> Ok, given that, just replacing HPAs we can't translate with a clear
> entry seems fine to me.


Ok.


>>> What about in real mode vs. virtual mode?
>>
>> Real mode is no different in this matter.
>>
>> Real mode is different from virtual mode in 3 aspects:
>>
>> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
>> inhibited writes to invalidate "TCE kill" cache;
>>
>> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
>> lockdep does not work in real mode properly;
>>
>> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
>> addresses directly. Not expected to fail.
>>
>> This is a full list.
> 
> Ok.
> 
>>> I think the key to this approach will be to think carefully about what
>>> semantics you guarantee for mappings shadowed into the hardware
>>> tables.  For example, it might work to specify that the host mappings
>>> only match the GPA mappings if those GPA mapings are valid in the
>>> first place.  So, H_PUT_TCE etc. would succeed as long as they're able
>>> to update the view of the table in terms of GPA.  But when you shadow
>>> those into the HPA tables, any entries which can't be translated you
>>> just replace with a cleared entry.
>>
>> Literally with zero? Silently? WARN_ON_ONCE?
> 
> Well, with a no-permission TCE, which might as well be zero, yes.
> 
> WARN_ON_ONCE() is probably a good idea.
> 
>>> That should be enough to protect
>>> the host.  Obviously you can expect the device to fail when you
>>> actually attempt to DMA there, but that's the guest's (or qemu's) own
>>> fault for putting bad addresses in the TCE table.
>>>
>>> Obviously that might not be great for debugging, since mappings will
>>> appear to succeed, but then not work later on.
>>>
>>> This does have the nice property that it's reasonably obvious what to
>>> do if you have some GPA mappings for emulated devices, then hotplug a
>>> VFIO device and at that point hit a gpa->hpa translation error.
>>> There's no hcall in this case, so there's no obvious way to return an
>>> error to the guest.
>>
>> Right. So if I do this, you would probably even ack this? :)
> 
> Assuming I don't spot some other showstopper...
> 
> Oh.. one thing to make sure you think about though: what happens if a
> guest makes some mappings, then there's a memory hotplug event which
> changes the set of valid GPAs?  In particular what if you hot unplug
> some memory which is mapped in a guest TCE table?  You might have to
> regenerate the HPA tables from the GPA table on hot unplug (unless you
> have a way of locking out an unplug event while that piece of guest
> ram is TCE mapped).


The guest is expected to clear TCE table. Then QEMU will delete regions
which will invoke unregistration of previously registered memory and if the
guest failed to clear TCE table, these preregistered pages will remain
pinned, this is what is that mm_iommu_mapped_inc/dec is about.



>>>> To do b) I'll need:
>>>>
>>>> 1. to have a copy of TCEs from the guest in vcpu,
>>>
>>> I don't quite understand this.  You need a temporary copy, yes, but I
>>> don't see why it needs to be attached to the vcpu.
>>
>> It does not need, I just need a safe + static + lock-free place for it as I
>> do not want to do malloc() in the TCE handlers and (in theory) multiple
>> CPUs can do concurrent TCE requests and I want to avoid locking especially
>> in realmode.
> 
> Ah, right, it's the inability to malloc() that's the difficulty.  You
> could put it in the vcpu, or you could use a per-(host)-cpu area - you
> can't switch guests while in a realmode handler.


vcpu looks as a safe choice, it is just a bit annoying that each CPU will
use 4K for something which most likely won't be used though.



>>
>>
>>>> I populate it via
>>>> get_user() to make sure they won't change;
>>>> 2. an array of userspace addresses translated from given TCEs; and in order
>>>> to make sure these addresses won't go away, I'll need to reference each
>>>> preregistered memory area via mm_iommu_mapped_inc().
>>>>
>>>> When I reach TCE#100, I'll have to revert the change, i.e. call
>>>> mm_iommu_mapped_dec().
>>>
>>> Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
>>> the updated translations into a temp buffer.  Then you'd to make more
>>> temp buffers to store the UA and HPA translations (although maybe you
>>> could overwrite/reuse the original temp buffer if you're careful).
>>> Then only if all of those succeed do you copy them into the real
>>> hardware tables.
>>>
>>> Which sounds like it might be kinda messy, at least in real mode.
>>
>> So is it worth it?
> 
> Option (a) is certainly looking better to me based on current
> information.
> 
>>>> So I will end up having 2 arrays in a vcpu and simpler reverting code.
>>>>
>>>>
>>>> Or I can do simpler version of b) which would store guest TCEs in
>>>> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
>>>> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
>>>> request, some preregistered regions will stay referenced till the guest is
>>>> killed or rebooted (and this will prevent memory from unregistering) - but
>>>> this means no harm to the host;
>>>
>>> Hrm.. that's not really true.  It's not the worst thing that can
>>> happen, but allowing the guest to permanently lock extra chunks of
>>> memory is a form of harm to the host.
>>
>>
>> These are the same preregistered chunks which are already locked. And the
>> lock is there till QEMU process is dead. What will not be possible is
>> memory hotunplug.
> 
> Ah, ok, I see your point.  That's probably sufficient, but option (a)
> is still looking better.
> 
>>>> and with preregistered RAM, there is no
>>>> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
>>>>
>>>>
>>>>
>>>> Which approach to pick?
>>>>
>>>>
>>>> LoPAPR says:
>>>> ===
>>>> If the TCE parameter represents the logical page address of a page that is
>>>> not valid for the calling partition, return
>>>> H_Parameter.
>>>> ===
>>>>
>>>>
>>>>
>>>>>>  
>>>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>>>  	}
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> index 918af76ab2b6..f8a54b7c788e 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  {
>>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>>  	long i, ret = H_SUCCESS;
>>>>>> -	unsigned long tces, entry, ua = 0;
>>>>>> +	unsigned long tces, entry, tce, ua = 0;
>>>>>>  	unsigned long *rmap = NULL;
>>>>>>  
>>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  	}
>>>>>>  
>>>>>>  	for (i = 0; i < npages; ++i) {
>>>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>>  
>>>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>>>  		if (ret != H_SUCCESS)
>>>>>>  			goto unlock_exit;
>>>>>> +	}
>>>>>> +
>>>>>> +	for (i = 0; i < npages; ++i) {
>>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>
>>>>> Same problem here.
>>>>>
>>>>>>  
>>>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>>>  	}
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-02-10  7:58 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-07  7:17 [PATCH kernel v4 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2017-02-07  7:17 ` Alexey Kardashevskiy
2017-02-07  7:17 ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  3:51   ` David Gibson
2017-02-09  3:51     ` David Gibson
2017-02-09  3:51     ` David Gibson
2017-02-09  8:20     ` Alexey Kardashevskiy
2017-02-09  8:20       ` Alexey Kardashevskiy
2017-02-10  3:07       ` David Gibson
2017-02-10  3:07         ` David Gibson
2017-02-10  4:09         ` Alexey Kardashevskiy
2017-02-10  4:09           ` Alexey Kardashevskiy
2017-02-10  4:50           ` David Gibson
2017-02-10  4:50             ` David Gibson
2017-02-10  4:50             ` David Gibson
2017-02-10  7:58             ` Alexey Kardashevskiy
2017-02-10  7:58               ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  4:00   ` David Gibson
2017-02-09  4:00     ` David Gibson
2017-02-09  4:00     ` David Gibson
2017-02-07  7:17 ` [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  6:41   ` David Gibson
2017-02-09  6:41     ` David Gibson
2017-02-09  6:41     ` David Gibson
2017-02-10  2:50     ` Alexey Kardashevskiy
2017-02-10  2:50       ` Alexey Kardashevskiy
2017-02-10  2:50       ` Alexey Kardashevskiy
2017-02-10  4:02       ` David Gibson
2017-02-10  4:02         ` David Gibson
2017-02-10  4:02         ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.