[PATCH kernel 0/9] KVM, PPC, VFIO: Enable in-kernel acceleration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH kernel 0/9] KVM, PPC, VFIO: Enable in-kernel acceleration
@ 2016-03-07  3:41 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

This enables in-kernel acceleration of H_PUT_TCE/etc hypercalls for pseries
guests using VFIO. As pseries is a para-virtualized environment, the guest
can see and control IOMMUs via special hypercalls which let the guest
to add and remove mappings in real hardware IOMMU.

This was posted last time quite a long time ago so I dropped versions now,
this re-respin is v1. This was successfully used in the PowerKVM product
for quite a while now.

This is based on git://git.kernel.org/pub/scm/virt/kvm/kvm.git , "next"
branch which got "multi-tce in-kernel acceleration" and "64 bit in-kernel
TCE" support.

Please comment. Thanks!


Alexey Kardashevskiy (9):
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of xchg()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Associate IOMMU group with guest view of TCE table
  KVM: PPC: Create a virtual-mode only TCE table handlers
  KVM: PPC: Add in-kernel handling for VFIO
  KVM: PPC: VFIO device: support SPAPR TCE

 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/iommu.h           |   7 +
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   6 +
 arch/powerpc/include/asm/mmu_context.h     |   6 +-
 arch/powerpc/kernel/iommu.c                |  15 ++
 arch/powerpc/kvm/Kconfig                   |   2 +
 arch/powerpc/kvm/Makefile                  |   5 +-
 arch/powerpc/kvm/book3s_64_vio.c           | 344 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 280 +++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S    |   4 +-
 arch/powerpc/kvm/powerpc.c                 |   1 +
 arch/powerpc/mm/mmu_context_iommu.c        |  45 +++-
 arch/powerpc/platforms/powernv/pci-ioda.c  |  28 ++-
 include/uapi/linux/kvm.h                   |  10 +
 virt/kvm/vfio.c                            | 106 +++++++++
 16 files changed, 855 insertions(+), 33 deletions(-)

-- 
2.5.0.rc3


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH kernel 0/9] KVM, PPC, VFIO: Enable in-kernel acceleration
@ 2016-03-07  3:41 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

This enables in-kernel acceleration of H_PUT_TCE/etc hypercalls for pseries
guests using VFIO. As pseries is a para-virtualized environment, the guest
can see and control IOMMUs via special hypercalls which let the guest
to add and remove mappings in real hardware IOMMU.

This was posted last time quite a long time ago so I dropped versions now,
this re-respin is v1. This was successfully used in the PowerKVM product
for quite a while now.

This is based on git://git.kernel.org/pub/scm/virt/kvm/kvm.git , "next"
branch which got "multi-tce in-kernel acceleration" and "64 bit in-kernel
TCE" support.

Please comment. Thanks!


Alexey Kardashevskiy (9):
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of xchg()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Associate IOMMU group with guest view of TCE table
  KVM: PPC: Create a virtual-mode only TCE table handlers
  KVM: PPC: Add in-kernel handling for VFIO
  KVM: PPC: VFIO device: support SPAPR TCE

 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/iommu.h           |   7 +
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   6 +
 arch/powerpc/include/asm/mmu_context.h     |   6 +-
 arch/powerpc/kernel/iommu.c                |  15 ++
 arch/powerpc/kvm/Kconfig                   |   2 +
 arch/powerpc/kvm/Makefile                  |   5 +-
 arch/powerpc/kvm/book3s_64_vio.c           | 344 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 280 +++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S    |   4 +-
 arch/powerpc/kvm/powerpc.c                 |   1 +
 arch/powerpc/mm/mmu_context_iommu.c        |  45 +++-
 arch/powerpc/platforms/powernv/pci-ioda.c  |  28 ++-
 include/uapi/linux/kvm.h                   |  10 +
 virt/kvm/vfio.c                            | 106 +++++++++
 16 files changed, 855 insertions(+), 33 deletions(-)

-- 
2.5.0.rc3


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c251f06..080ffbf 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -863,6 +863,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_HYPERV_SYNIC 123
 #define KVM_CAP_S390_RI 124
 #define KVM_CAP_SPAPR_TCE_64 125
+#define KVM_CAP_SPAPR_TCE_VFIO 126

 #ifdef KVM_CAP_IRQ_ROUTING

-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c251f06..080ffbf 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -863,6 +863,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_HYPERV_SYNIC 123
 #define KVM_CAP_S390_RI 124
 #define KVM_CAP_SPAPR_TCE_64 125
+#define KVM_CAP_SPAPR_TCE_VFIO 126

 #ifdef KVM_CAP_IRQ_ROUTING

-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 2/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_rm_ua_to_hpa() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  6 ++++-
 arch/powerpc/mm/mmu_context_iommu.c    | 45 ++++++++++++++++++++++++++++++----
 2 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 878c277..3ba652a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -18,7 +18,7 @@ extern void destroy_context(struct mm_struct *mm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 struct mm_iommu_table_group_mem_t;
 
-extern bool mm_iommu_preregistered(void);
+extern bool mm_iommu_preregistered(mm_context_t *mm);
 extern long mm_iommu_get(unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
@@ -26,10 +26,14 @@ extern void mm_iommu_init(mm_context_t *ctx);
 extern void mm_iommu_cleanup(mm_context_t *ctx);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
 		unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(mm_context_t *mm,
+		unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
 		unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_rm_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..aa1565d 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -63,12 +63,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
 	return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(mm_context_t *mm)
 {
-	if (!current || !current->mm)
-		return false;
-
-	return !list_empty(&current->mm->context.iommu_group_mem_list);
+	return !list_empty(&mm->iommu_group_mem_list);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
@@ -231,6 +228,24 @@ unlock_exit:
 }
 EXPORT_SYMBOL_GPL(mm_iommu_put);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(mm_context_t *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->iommu_group_mem_list, next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
 		unsigned long size)
 {
@@ -284,6 +299,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_rm_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *ra;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	ra = (void *) vmalloc_to_phys(va);
+	if (!ra)
+		return -EFAULT;
+
+	*hpa = *ra | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_rm_ua_to_hpa);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 2/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_rm_ua_to_hpa() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  6 ++++-
 arch/powerpc/mm/mmu_context_iommu.c    | 45 ++++++++++++++++++++++++++++++----
 2 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 878c277..3ba652a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -18,7 +18,7 @@ extern void destroy_context(struct mm_struct *mm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 struct mm_iommu_table_group_mem_t;
 
-extern bool mm_iommu_preregistered(void);
+extern bool mm_iommu_preregistered(mm_context_t *mm);
 extern long mm_iommu_get(unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
@@ -26,10 +26,14 @@ extern void mm_iommu_init(mm_context_t *ctx);
 extern void mm_iommu_cleanup(mm_context_t *ctx);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
 		unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(mm_context_t *mm,
+		unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
 		unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_rm_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..aa1565d 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -63,12 +63,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
 	return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(mm_context_t *mm)
 {
-	if (!current || !current->mm)
-		return false;
-
-	return !list_empty(&current->mm->context.iommu_group_mem_list);
+	return !list_empty(&mm->iommu_group_mem_list);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
@@ -231,6 +228,24 @@ unlock_exit:
 }
 EXPORT_SYMBOL_GPL(mm_iommu_put);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(mm_context_t *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->iommu_group_mem_list, next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
 		unsigned long size)
 {
@@ -284,6 +299,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_rm_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *ra;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	ra = (void *) vmalloc_to_phys(va);
+	if (!ra)
+		return -EFAULT;
+
+	*hpa = *ra | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_rm_ua_to_hpa);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
 1 file changed, 70 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 44be73e..af155f6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
+{
+	struct task_struct *task;
+
+	task = vcpu->arch.run_task;
+	if (unlikely(!task || !task->mm))
+		return NULL;
+
+	return &task->mm->context;
+}
+
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+	mm_context_t *mm = kvmppc_mm_context(vcpu);
+
+	if (unlikely(!mm))
+		return false;
+
+	return mm_iommu_preregistered(mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+	mm_context_t *mm = kvmppc_mm_context(vcpu);
+
+	if (unlikely(!mm))
+		return NULL;
+
+	return mm_iommu_lookup_rm(mm, ua, size);
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (kvmppc_preregistered(vcpu)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
+			return H_TOO_HARD;
+	} else {
+		/*
+		 * This is emulated devices case.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
 1 file changed, 70 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 44be73e..af155f6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
+{
+	struct task_struct *task;
+
+	task = vcpu->arch.run_task;
+	if (unlikely(!task || !task->mm))
+		return NULL;
+
+	return &task->mm->context;
+}
+
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+	mm_context_t *mm = kvmppc_mm_context(vcpu);
+
+	if (unlikely(!mm))
+		return false;
+
+	return mm_iommu_preregistered(mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+	mm_context_t *mm = kvmppc_mm_context(vcpu);
+
+	if (unlikely(!mm))
+		return NULL;
+
+	return mm_iommu_lookup_rm(mm, ua, size);
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (kvmppc_preregistered(vcpu)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
+			return H_TOO_HARD;
+	} else {
+		/*
+		 * This is emulated devices case.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

In real mode, TCE tables are invalidated using different
cache-inhibited store instructions which is different from
the virtual mode.

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 15 +++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++++++++++++++++++++++++++-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7b87bab..3ca877a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a8e3490..2fcc48b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+			(*direction == DMA_BIDIRECTIONAL)))
+		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_add_device(struct device *dev)
 {
 	struct iommu_table *tbl;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c5baaf3..bed1944 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1791,6 +1791,18 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret && (tbl->it_type &
+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+		pnv_pci_ioda1_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1806,6 +1818,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1866,7 +1879,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *npe;
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
@@ -1918,6 +1931,18 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret && (tbl->it_type &
+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -1939,6 +1964,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

In real mode, TCE tables are invalidated using different
cache-inhibited store instructions which is different from
the virtual mode.

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 15 +++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++++++++++++++++++++++++++-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7b87bab..3ca877a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a8e3490..2fcc48b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
+			(*direction = DMA_BIDIRECTIONAL)))
+		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_add_device(struct device *dev)
 {
 	struct iommu_table *tbl;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c5baaf3..bed1944 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1791,6 +1791,18 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret && (tbl->it_type &
+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+		pnv_pci_ioda1_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1806,6 +1818,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1866,7 +1879,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *npe;
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
@@ -1918,6 +1931,18 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret && (tbl->it_type &
+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -1939,6 +1964,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 5/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c2024ac..1059846 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 5/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c2024ac..1059846 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

The existing in-kernel TCE table for emulated devices contains
guest physical addresses which are accesses by emulated devices.
Since we need to keep this information for VFIO devices too
in order to implement H_GET_TCE, we are reusing it.

This adds IOMMU group list to kvmppc_spapr_tce_table. Each group
will have an iommu_table pointer.

This adds kvm_spapr_tce_attach_iommu_group() helper and its detach
counterpart to manage the lists.

This puts a group when:
- guest copy of TCE table is destroyed when TCE table fd is closed;
- kvm_spapr_tce_detach_iommu_group() is called from
the KVM_DEV_VFIO_GROUP_DEL ioctl handler in the case vfio-pci hotunplug
(will be added in the following patch).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_host.h |   8 +++
 arch/powerpc/include/asm/kvm_ppc.h  |   6 ++
 arch/powerpc/kvm/book3s_64_vio.c    | 108 ++++++++++++++++++++++++++++++++++++
 3 files changed, 122 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 2e7c791..2c5c823 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_group {
+	struct list_head next;
+	struct rcu_head rcu;
+	struct iommu_group *refgrp;/* for reference counting only */
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -186,6 +193,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head groups;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 2544eda..d1482dc 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,12 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+				unsigned long liobn,
+				phys_addr_t start_addr,
+				struct iommu_group *grp);
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+				struct iommu_group *grp);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 2c2d103..846d16d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,7 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -95,10 +96,18 @@ static void release_spapr_tce_table(struct rcu_head *head)
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
 			struct kvmppc_spapr_tce_table, rcu);
 	unsigned long i, npages = kvmppc_tce_pages(stt->size);
+	struct kvmppc_spapr_tce_group *kg;
 
 	for (i = 0; i < npages; i++)
 		__free_page(stt->pages[i]);
 
+	while (!list_empty(&stt->groups)) {
+		kg = list_first_entry(&stt->groups,
+				struct kvmppc_spapr_tce_group, next);
+		list_del(&kg->next);
+		kfree(kg);
+	}
+
 	kfree(stt);
 }
 
@@ -129,9 +138,15 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
 static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 {
 	struct kvmppc_spapr_tce_table *stt = filp->private_data;
+	struct kvmppc_spapr_tce_group *kg;
 
 	list_del_rcu(&stt->list);
 
+	list_for_each_entry_rcu(kg, &stt->groups, next)	{
+		iommu_group_put(kg->refgrp);
+		kg->refgrp = NULL;
+	}
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -146,6 +161,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
 	.release	= kvm_spapr_tce_release,
 };
 
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+				unsigned long liobn,
+				phys_addr_t start_addr,
+				struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	struct iommu_table_group *table_group;
+	long i;
+	bool found = false;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn == liobn) {
+			if ((stt->offset << stt->page_shift) != start_addr)
+				return -EINVAL;
+
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return -ENODEV;
+
+	/* Find IOMMU group and table at @start_addr */
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	tbltmp = NULL;
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = table_group->tables[i];
+
+		if (!tbl)
+			continue;
+
+		if ((tbl->it_page_shift == stt->page_shift) &&
+				(tbl->it_offset == stt->offset)) {
+			tbltmp = tbl;
+			break;
+		}
+	}
+	if (!tbltmp)
+		return -ENODEV;
+
+	list_for_each_entry_rcu(kg, &stt->groups, next) {
+		if (kg->refgrp == grp)
+			return -EBUSY;
+	}
+
+	kg = kzalloc(sizeof(*kg), GFP_KERNEL);
+	kg->refgrp = grp;
+	kg->tbl = tbltmp;
+	list_add_rcu(&kg->next, &stt->groups);
+
+	return 0;
+}
+
+static void kvm_spapr_tce_put_group(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_group *kg = container_of(head,
+			struct kvmppc_spapr_tce_group, rcu);
+
+	iommu_group_put(kg->refgrp);
+	kg->refgrp = NULL;
+	kfree(kg);
+}
+
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+				struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	struct iommu_table_group *table_group;
+	struct kvmppc_spapr_tce_group *kg;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		list_for_each_entry_rcu(kg, &stt->groups, next) {
+			if (kg->refgrp == grp) {
+				list_del_rcu(&kg->next);
+				call_rcu(&kg->rcu, kvm_spapr_tce_put_group);
+				break;
+			}
+		}
+	}
+}
+
 long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				   struct kvm_create_spapr_tce_64 *args)
 {
@@ -181,6 +288,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->groups);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

The existing in-kernel TCE table for emulated devices contains
guest physical addresses which are accesses by emulated devices.
Since we need to keep this information for VFIO devices too
in order to implement H_GET_TCE, we are reusing it.

This adds IOMMU group list to kvmppc_spapr_tce_table. Each group
will have an iommu_table pointer.

This adds kvm_spapr_tce_attach_iommu_group() helper and its detach
counterpart to manage the lists.

This puts a group when:
- guest copy of TCE table is destroyed when TCE table fd is closed;
- kvm_spapr_tce_detach_iommu_group() is called from
the KVM_DEV_VFIO_GROUP_DEL ioctl handler in the case vfio-pci hotunplug
(will be added in the following patch).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_host.h |   8 +++
 arch/powerpc/include/asm/kvm_ppc.h  |   6 ++
 arch/powerpc/kvm/book3s_64_vio.c    | 108 ++++++++++++++++++++++++++++++++++++
 3 files changed, 122 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 2e7c791..2c5c823 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_group {
+	struct list_head next;
+	struct rcu_head rcu;
+	struct iommu_group *refgrp;/* for reference counting only */
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -186,6 +193,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head groups;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 2544eda..d1482dc 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,12 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+				unsigned long liobn,
+				phys_addr_t start_addr,
+				struct iommu_group *grp);
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+				struct iommu_group *grp);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 2c2d103..846d16d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,7 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -95,10 +96,18 @@ static void release_spapr_tce_table(struct rcu_head *head)
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
 			struct kvmppc_spapr_tce_table, rcu);
 	unsigned long i, npages = kvmppc_tce_pages(stt->size);
+	struct kvmppc_spapr_tce_group *kg;
 
 	for (i = 0; i < npages; i++)
 		__free_page(stt->pages[i]);
 
+	while (!list_empty(&stt->groups)) {
+		kg = list_first_entry(&stt->groups,
+				struct kvmppc_spapr_tce_group, next);
+		list_del(&kg->next);
+		kfree(kg);
+	}
+
 	kfree(stt);
 }
 
@@ -129,9 +138,15 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
 static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 {
 	struct kvmppc_spapr_tce_table *stt = filp->private_data;
+	struct kvmppc_spapr_tce_group *kg;
 
 	list_del_rcu(&stt->list);
 
+	list_for_each_entry_rcu(kg, &stt->groups, next)	{
+		iommu_group_put(kg->refgrp);
+		kg->refgrp = NULL;
+	}
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -146,6 +161,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
 	.release	= kvm_spapr_tce_release,
 };
 
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+				unsigned long liobn,
+				phys_addr_t start_addr,
+				struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	struct iommu_table_group *table_group;
+	long i;
+	bool found = false;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn = liobn) {
+			if ((stt->offset << stt->page_shift) != start_addr)
+				return -EINVAL;
+
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return -ENODEV;
+
+	/* Find IOMMU group and table at @start_addr */
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return -EFAULT;
+
+	tbltmp = NULL;
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = table_group->tables[i];
+
+		if (!tbl)
+			continue;
+
+		if ((tbl->it_page_shift = stt->page_shift) &&
+				(tbl->it_offset = stt->offset)) {
+			tbltmp = tbl;
+			break;
+		}
+	}
+	if (!tbltmp)
+		return -ENODEV;
+
+	list_for_each_entry_rcu(kg, &stt->groups, next) {
+		if (kg->refgrp = grp)
+			return -EBUSY;
+	}
+
+	kg = kzalloc(sizeof(*kg), GFP_KERNEL);
+	kg->refgrp = grp;
+	kg->tbl = tbltmp;
+	list_add_rcu(&kg->next, &stt->groups);
+
+	return 0;
+}
+
+static void kvm_spapr_tce_put_group(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_group *kg = container_of(head,
+			struct kvmppc_spapr_tce_group, rcu);
+
+	iommu_group_put(kg->refgrp);
+	kg->refgrp = NULL;
+	kfree(kg);
+}
+
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+				struct iommu_group *grp)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	struct iommu_table_group *table_group;
+	struct kvmppc_spapr_tce_group *kg;
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (!table_group)
+		return;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		list_for_each_entry_rcu(kg, &stt->groups, next) {
+			if (kg->refgrp = grp) {
+				list_del_rcu(&kg->next);
+				call_rcu(&kg->rcu, kvm_spapr_tce_put_group);
+				break;
+			}
+		}
+	}
+}
+
 long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				   struct kvm_create_spapr_tce_64 *args)
 {
@@ -181,6 +288,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->groups);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 7/9] KVM: PPC: Create a virtual-mode only TCE table handlers
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

In-kernel VFIO acceleration needs different handling in real and virtual
modes which makes it hard to support both modes in the same handler.

This creates a copy of kvmppc_rm_h_stuff_tce and kvmppc_rm_h_put_tce
in addition to the existing kvmppc_rm_h_put_tce_indirect.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio.c        | 52 +++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  8 ++---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  4 +--
 3 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 846d16d..7965fc7 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -317,6 +317,32 @@ fail:
 	return ret;
 }
 
+long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+		      unsigned long ioba, unsigned long tce)
+{
+	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	long ret;
+
+	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
+	/* 	    liobn, ioba, tce); */
+
+	if (!stt)
+		return H_TOO_HARD;
+
+	ret = kvmppc_ioba_validate(stt, ioba, 1);
+	if (ret != H_SUCCESS)
+		return ret;
+
+	ret = kvmppc_tce_validate(stt, tce);
+	if (ret != H_SUCCESS)
+		return ret;
+
+	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
+
 long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_list, unsigned long npages)
@@ -372,3 +398,29 @@ unlock_exit:
 	return ret;
 }
 EXPORT_SYMBOL_GPL(kvmppc_h_put_tce_indirect);
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret;
+
+	stt = kvmppc_find_table(vcpu, liobn);
+	if (!stt)
+		return H_TOO_HARD;
+
+	ret = kvmppc_ioba_validate(stt, ioba, npages);
+	if (ret != H_SUCCESS)
+		return ret;
+
+	/* Check permission bits only to allow userspace poison TCE for debug */
+	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
+		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_stuff_tce);
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index af155f6..11163ae 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -212,8 +212,8 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(mm, ua, size);
 }
 
-long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-		      unsigned long ioba, unsigned long tce)
+long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
 	long ret;
@@ -236,7 +236,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 
 	return H_SUCCESS;
 }
-EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
 
 static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
 		unsigned long ua, unsigned long *phpa)
@@ -350,7 +349,7 @@ unlock_exit:
 	return ret;
 }
 
-long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages)
 {
@@ -374,7 +373,6 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 
 	return H_SUCCESS;
 }
-EXPORT_SYMBOL_GPL(kvmppc_h_stuff_tce);
 
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index ed16182..d6dad2c 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1928,7 +1928,7 @@ hcall_real_table:
 	.long	DOTSYM(kvmppc_h_clear_ref) - hcall_real_table
 	.long	DOTSYM(kvmppc_h_protect) - hcall_real_table
 	.long	DOTSYM(kvmppc_h_get_tce) - hcall_real_table
-	.long	DOTSYM(kvmppc_h_put_tce) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_put_tce) - hcall_real_table
 	.long	0		/* 0x24 - H_SET_SPRG0 */
 	.long	DOTSYM(kvmppc_h_set_dabr) - hcall_real_table
 	.long	0		/* 0x2c */
@@ -2006,7 +2006,7 @@ hcall_real_table:
 	.long	0		/* 0x12c */
 	.long	0		/* 0x130 */
 	.long	DOTSYM(kvmppc_h_set_xdabr) - hcall_real_table
-	.long	DOTSYM(kvmppc_h_stuff_tce) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_stuff_tce) - hcall_real_table
 	.long	DOTSYM(kvmppc_rm_h_put_tce_indirect) - hcall_real_table
 	.long	0		/* 0x140 */
 	.long	0		/* 0x144 */
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 7/9] KVM: PPC: Create a virtual-mode only TCE table handlers
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

In-kernel VFIO acceleration needs different handling in real and virtual
modes which makes it hard to support both modes in the same handler.

This creates a copy of kvmppc_rm_h_stuff_tce and kvmppc_rm_h_put_tce
in addition to the existing kvmppc_rm_h_put_tce_indirect.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio.c        | 52 +++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  8 ++---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  4 +--
 3 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 846d16d..7965fc7 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -317,6 +317,32 @@ fail:
 	return ret;
 }
 
+long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+		      unsigned long ioba, unsigned long tce)
+{
+	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	long ret;
+
+	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
+	/* 	    liobn, ioba, tce); */
+
+	if (!stt)
+		return H_TOO_HARD;
+
+	ret = kvmppc_ioba_validate(stt, ioba, 1);
+	if (ret != H_SUCCESS)
+		return ret;
+
+	ret = kvmppc_tce_validate(stt, tce);
+	if (ret != H_SUCCESS)
+		return ret;
+
+	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
+
 long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_list, unsigned long npages)
@@ -372,3 +398,29 @@ unlock_exit:
 	return ret;
 }
 EXPORT_SYMBOL_GPL(kvmppc_h_put_tce_indirect);
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret;
+
+	stt = kvmppc_find_table(vcpu, liobn);
+	if (!stt)
+		return H_TOO_HARD;
+
+	ret = kvmppc_ioba_validate(stt, ioba, npages);
+	if (ret != H_SUCCESS)
+		return ret;
+
+	/* Check permission bits only to allow userspace poison TCE for debug */
+	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
+		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_stuff_tce);
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index af155f6..11163ae 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -212,8 +212,8 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(mm, ua, size);
 }
 
-long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-		      unsigned long ioba, unsigned long tce)
+long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
 	long ret;
@@ -236,7 +236,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 
 	return H_SUCCESS;
 }
-EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
 
 static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
 		unsigned long ua, unsigned long *phpa)
@@ -350,7 +349,7 @@ unlock_exit:
 	return ret;
 }
 
-long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages)
 {
@@ -374,7 +373,6 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 
 	return H_SUCCESS;
 }
-EXPORT_SYMBOL_GPL(kvmppc_h_stuff_tce);
 
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index ed16182..d6dad2c 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1928,7 +1928,7 @@ hcall_real_table:
 	.long	DOTSYM(kvmppc_h_clear_ref) - hcall_real_table
 	.long	DOTSYM(kvmppc_h_protect) - hcall_real_table
 	.long	DOTSYM(kvmppc_h_get_tce) - hcall_real_table
-	.long	DOTSYM(kvmppc_h_put_tce) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_put_tce) - hcall_real_table
 	.long	0		/* 0x24 - H_SET_SPRG0 */
 	.long	DOTSYM(kvmppc_h_set_dabr) - hcall_real_table
 	.long	0		/* 0x2c */
@@ -2006,7 +2006,7 @@ hcall_real_table:
 	.long	0		/* 0x12c */
 	.long	0		/* 0x130 */
 	.long	DOTSYM(kvmppc_h_set_xdabr) - hcall_real_table
-	.long	DOTSYM(kvmppc_h_stuff_tce) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_stuff_tce) - hcall_real_table
 	.long	DOTSYM(kvmppc_rm_h_put_tce_indirect) - hcall_real_table
 	.long	0		/* 0x140 */
 	.long	0		/* 0x144 */
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space; this is not expected
to happen ever though.

The first user of this is VFIO on POWER. Trampolines to the VFIO external
user API functions are required for this patch.

This uses a VFIO KVM device to associate a logical bus number (LIOBN)
with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
requests.

To make use of the feature, the user space has to create a guest view
of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
then associate a LIOBN with this table via VFIO KVM device,
a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
the next patch).

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
 2 files changed, 370 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 7965fc7..9417d12 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -33,6 +33,7 @@
 #include <asm/kvm_ppc.h>
 #include <asm/kvm_book3s.h>
 #include <asm/mmu-hash64.h>
+#include <asm/mmu_context.h>
 #include <asm/hvcall.h>
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
@@ -317,11 +318,161 @@ fail:
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
+		unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(*pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
+		unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		return H_HARDWARE;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, tce))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
 	long ret;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -337,6 +488,15 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl == tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_h_put_tce_iommu(vcpu, kg->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -351,6 +511,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	long i, ret = H_SUCCESS, idx;
 	unsigned long entry, ua = 0;
 	u64 __user *tces, tce;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	stt = kvmppc_find_table(vcpu, liobn);
 	if (!stt)
@@ -378,6 +540,16 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl == tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				kg->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -405,6 +577,8 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	stt = kvmppc_find_table(vcpu, liobn);
 	if (!stt)
@@ -418,6 +592,16 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl == tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, kg->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 11163ae..6567d6c 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -212,11 +213,162 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_rm_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
 	long ret;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -232,6 +384,16 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl == tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, kg->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -272,6 +434,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	long i, ret = H_SUCCESS;
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
+	struct iommu_table *tbltmp = NULL;
 
 	stt = kvmppc_find_table(vcpu, liobn);
 	if (!stt)
@@ -299,6 +462,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_group *kg;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -306,6 +470,16 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(kg, &stt->groups, next) {
+			if (kg->tbl == tbltmp)
+				continue;
+			tbltmp = kg->tbl;
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					kg->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -355,6 +529,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	stt = kvmppc_find_table(vcpu, liobn);
 	if (!stt)
@@ -368,6 +544,16 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl == tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, kg->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space; this is not expected
to happen ever though.

The first user of this is VFIO on POWER. Trampolines to the VFIO external
user API functions are required for this patch.

This uses a VFIO KVM device to associate a logical bus number (LIOBN)
with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
requests.

To make use of the feature, the user space has to create a guest view
of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
then associate a LIOBN with this table via VFIO KVM device,
a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
the next patch).

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
 2 files changed, 370 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 7965fc7..9417d12 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -33,6 +33,7 @@
 #include <asm/kvm_ppc.h>
 #include <asm/kvm_book3s.h>
 #include <asm/mmu-hash64.h>
+#include <asm/mmu_context.h>
 #include <asm/hvcall.h>
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
@@ -317,11 +318,161 @@ fail:
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
+		unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(*pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
+		unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		return H_HARDWARE;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir = DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, tce))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
 	long ret;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -337,6 +488,15 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl = tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_h_put_tce_iommu(vcpu, kg->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -351,6 +511,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	long i, ret = H_SUCCESS, idx;
 	unsigned long entry, ua = 0;
 	u64 __user *tces, tce;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	stt = kvmppc_find_table(vcpu, liobn);
 	if (!stt)
@@ -378,6 +540,16 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl = tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				kg->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -405,6 +577,8 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	stt = kvmppc_find_table(vcpu, liobn);
 	if (!stt)
@@ -418,6 +592,16 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl = tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, kg->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 11163ae..6567d6c 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -212,11 +213,162 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir = DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_rm_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir = DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
 	long ret;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -232,6 +384,16 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl = tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, kg->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -272,6 +434,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	long i, ret = H_SUCCESS;
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
+	struct iommu_table *tbltmp = NULL;
 
 	stt = kvmppc_find_table(vcpu, liobn);
 	if (!stt)
@@ -299,6 +462,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_group *kg;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -306,6 +470,16 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(kg, &stt->groups, next) {
+			if (kg->tbl = tbltmp)
+				continue;
+			tbltmp = kg->tbl;
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					kg->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -355,6 +529,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_group *kg;
+	struct iommu_table *tbltmp = NULL;
 
 	stt = kvmppc_find_table(vcpu, liobn);
 	if (!stt)
@@ -368,6 +544,16 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(kg, &stt->groups, next) {
+		if (kg->tbl = tbltmp)
+			continue;
+		tbltmp = kg->tbl;
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, kg->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-07  3:41 ` Alexey Kardashevskiy
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
identifier. LIOBNs are made up, advertised to guest systems and
linked to IOMMU groups by the user space.
In order to enable acceleration for IOMMU operations in KVM, we need
to tell KVM the information about the LIOBN-to-group mapping.

For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
is added which accepts:
- a VFIO group fd and IO base address to find the actual hardware
TCE table;
- a LIOBN to assign to the found table.

Before notifying KVM about new link, this check the group for being
registered with KVM device in order to release them at unexpected KVM
finish.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

While we are here, this also fixes VFIO KVM device compiling to let it
link to a KVM module.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
 arch/powerpc/kvm/Kconfig                   |   1 +
 arch/powerpc/kvm/Makefile                  |   5 +-
 arch/powerpc/kvm/powerpc.c                 |   1 +
 include/uapi/linux/kvm.h                   |   9 +++
 virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
 6 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740..c0d3eb7 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,24 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
+	kvm_device_attr.addr points to a struct:
+		struct kvm_vfio_spapr_tce_liobn {
+			__u32	argsz;
+			__s32	fd;
+			__u32	liobn;
+			__u8	pad[4];
+			__u64	start_addr;
+		};
+		where
+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
+		@fd is a file descriptor for a VFIO group;
+		@liobn is a logical bus id to be associated with the group;
+		@start_addr is a DMA window offset on the IO (PCI) bus
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 1059846..dfa3488 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
 	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
+	select KVM_VFIO if VFIO
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 7f7b6d8..71f577c 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
 KVM := ../../../virt/kvm
 
 common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-		$(KVM)/eventfd.o $(KVM)/vfio.o
+		$(KVM)/eventfd.o
 
 CFLAGS_e500_mmu.o := -I.
 CFLAGS_e500_mmu_host.o := -I.
@@ -87,6 +87,9 @@ endif
 kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
 	book3s_xics.o
 
+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
+	$(KVM)/vfio.o \
+
 kvm-book3s_64-module-objs += \
 	$(KVM)/kvm_main.o \
 	$(KVM)/eventfd.o \
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 19aa59b..63f188d 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 080ffbf..f1abbea 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1075,6 +1076,14 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce_liobn {
+	__u32	argsz;
+	__s32	fd;
+	__u32	liobn;
+	__u8	pad[4];
+	__u64	start_addr;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 1dd087d..87c771e 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
 	symbol_put(vfio_group_put_external_user);
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 {
 	long (*fn)(struct vfio_group *, unsigned long);
@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
 	mutex_unlock(&kv->lock);
 }
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
+		struct vfio_group *vfio_group)
+{
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
+	grp = iommu_group_get_by_id(group_id);
+	if (grp) {
+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
+		iommu_group_put(grp);
+	}
+}
+#endif
+
 static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 {
 	struct kvm_vfio *kv = dev->private;
@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 				continue;
 
 			list_del(&kvg->node);
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
+					kvg->vfio_group);
+#endif
 			kvm_vfio_group_put_external_user(kvg->vfio_group);
 			kfree(kvg);
 			ret = 0;
@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
+		struct kvm_vfio_spapr_tce_liobn param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
+				start_addr);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		f = fdget(param.fd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.liobn, param.start_addr,
+					grp);
+			if (ret)
+				iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		kvm_vfio_group_put_external_user(vfio_group);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
+#endif
 			return 0;
 		}
 
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-03-07  3:41   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  3:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alex Williamson,
	David Gibson, kvm-ppc, kvm

sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
identifier. LIOBNs are made up, advertised to guest systems and
linked to IOMMU groups by the user space.
In order to enable acceleration for IOMMU operations in KVM, we need
to tell KVM the information about the LIOBN-to-group mapping.

For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
is added which accepts:
- a VFIO group fd and IO base address to find the actual hardware
TCE table;
- a LIOBN to assign to the found table.

Before notifying KVM about new link, this check the group for being
registered with KVM device in order to release them at unexpected KVM
finish.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

While we are here, this also fixes VFIO KVM device compiling to let it
link to a KVM module.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
 arch/powerpc/kvm/Kconfig                   |   1 +
 arch/powerpc/kvm/Makefile                  |   5 +-
 arch/powerpc/kvm/powerpc.c                 |   1 +
 include/uapi/linux/kvm.h                   |   9 +++
 virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
 6 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740..c0d3eb7 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,24 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
+	kvm_device_attr.addr points to a struct:
+		struct kvm_vfio_spapr_tce_liobn {
+			__u32	argsz;
+			__s32	fd;
+			__u32	liobn;
+			__u8	pad[4];
+			__u64	start_addr;
+		};
+		where
+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
+		@fd is a file descriptor for a VFIO group;
+		@liobn is a logical bus id to be associated with the group;
+		@start_addr is a DMA window offset on the IO (PCI) bus
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 1059846..dfa3488 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
 	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
+	select KVM_VFIO if VFIO
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 7f7b6d8..71f577c 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
 KVM := ../../../virt/kvm
 
 common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-		$(KVM)/eventfd.o $(KVM)/vfio.o
+		$(KVM)/eventfd.o
 
 CFLAGS_e500_mmu.o := -I.
 CFLAGS_e500_mmu_host.o := -I.
@@ -87,6 +87,9 @@ endif
 kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
 	book3s_xics.o
 
+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
+	$(KVM)/vfio.o \
+
 kvm-book3s_64-module-objs += \
 	$(KVM)/kvm_main.o \
 	$(KVM)/eventfd.o \
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 19aa59b..63f188d 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 080ffbf..f1abbea 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1075,6 +1076,14 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce_liobn {
+	__u32	argsz;
+	__s32	fd;
+	__u32	liobn;
+	__u8	pad[4];
+	__u64	start_addr;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 1dd087d..87c771e 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
 	symbol_put(vfio_group_put_external_user);
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 {
 	long (*fn)(struct vfio_group *, unsigned long);
@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
 	mutex_unlock(&kv->lock);
 }
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
+		struct vfio_group *vfio_group)
+{
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
+	grp = iommu_group_get_by_id(group_id);
+	if (grp) {
+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
+		iommu_group_put(grp);
+	}
+}
+#endif
+
 static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 {
 	struct kvm_vfio *kv = dev->private;
@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 				continue;
 
 			list_del(&kvg->node);
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
+					kvg->vfio_group);
+#endif
 			kvm_vfio_group_put_external_user(kvg->vfio_group);
 			kfree(kvg);
 			ret = 0;
@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
+		struct kvm_vfio_spapr_tce_liobn param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
+				start_addr);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		f = fdget(param.fd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			int group_id;
+			struct iommu_group *grp;
+
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			group_id = kvm_vfio_external_user_iommu_id(
+					kvg->vfio_group);
+			grp = iommu_group_get_by_id(group_id);
+			if (!grp) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.liobn, param.start_addr,
+					grp);
+			if (ret)
+				iommu_group_put(grp);
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		kvm_vfio_group_put_external_user(vfio_group);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
+#endif
 			return 0;
 		}
 
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-07  4:58     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  4:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

On Mon, Mar 07, 2016 at 02:41:09PM +1100, Alexey Kardashevskiy wrote:
> This adds a capability number for in-kernel support for VFIO on
> SPAPR platform.
> 
> The capability will tell the user space whether in-kernel handlers of
> H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
> must not attempt allocating a TCE table in the host kernel via
> the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
> will not be passed to the user space which is desired action in
> the situation like that.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/uapi/linux/kvm.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index c251f06..080ffbf 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -863,6 +863,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_HYPERV_SYNIC 123
>  #define KVM_CAP_S390_RI 124
>  #define KVM_CAP_SPAPR_TCE_64 125
> +#define KVM_CAP_SPAPR_TCE_VFIO 126
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
@ 2016-03-07  4:58     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  4:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

On Mon, Mar 07, 2016 at 02:41:09PM +1100, Alexey Kardashevskiy wrote:
> This adds a capability number for in-kernel support for VFIO on
> SPAPR platform.
> 
> The capability will tell the user space whether in-kernel handlers of
> H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
> must not attempt allocating a TCE table in the host kernel via
> the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
> will not be passed to the user space which is desired action in
> the situation like that.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/uapi/linux/kvm.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index c251f06..080ffbf 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -863,6 +863,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_HYPERV_SYNIC 123
>  #define KVM_CAP_S390_RI 124
>  #define KVM_CAP_SPAPR_TCE_64 125
> +#define KVM_CAP_SPAPR_TCE_VFIO 126
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 2/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-07  5:30     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  5:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5050 bytes --]

On Mon, Mar 07, 2016 at 02:41:10PM +1100, Alexey Kardashevskiy wrote:
> This makes mm_iommu_lookup() able to work in realmode by replacing
> list_for_each_entry_rcu() (which can do debug stuff which can fail in
> real mode) with list_for_each_entry_lockless().
> 
> This adds realmode version of mm_iommu_ua_to_hpa() which adds
> explicit vmalloc'd-to-linear address conversion.
> Unlike mm_iommu_ua_to_hpa(), mm_iommu_rm_ua_to_hpa() can fail.
> 
> This changes mm_iommu_preregistered() to receive @mm as in real mode
> @current does not always have a correct pointer.

So, I'd generally expect a parameter called @mm to be an mm_struct *,
not a mm_context_t.

> 
> This adds realmode version of mm_iommu_lookup() which receives @mm
> (for the same reason as for mm_iommu_preregistered()) and uses
> lockless version of list_for_each_entry_rcu().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>



> ---
>  arch/powerpc/include/asm/mmu_context.h |  6 ++++-
>  arch/powerpc/mm/mmu_context_iommu.c    | 45 ++++++++++++++++++++++++++++++----
>  2 files changed, 45 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 878c277..3ba652a 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -18,7 +18,7 @@ extern void destroy_context(struct mm_struct *mm);
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>  struct mm_iommu_table_group_mem_t;
>  
> -extern bool mm_iommu_preregistered(void);
> +extern bool mm_iommu_preregistered(mm_context_t *mm);
>  extern long mm_iommu_get(unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem);
>  extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> @@ -26,10 +26,14 @@ extern void mm_iommu_init(mm_context_t *ctx);
>  extern void mm_iommu_cleanup(mm_context_t *ctx);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>  		unsigned long size);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(mm_context_t *mm,
> +		unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
>  		unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned long *hpa);
> +extern long mm_iommu_rm_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index da6a216..aa1565d 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -63,12 +63,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>  	return ret;
>  }
>  
> -bool mm_iommu_preregistered(void)
> +bool mm_iommu_preregistered(mm_context_t *mm)
>  {
> -	if (!current || !current->mm)
> -		return false;
> -
> -	return !list_empty(&current->mm->context.iommu_group_mem_list);
> +	return !list_empty(&mm->iommu_group_mem_list);
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> @@ -231,6 +228,24 @@ unlock_exit:
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_put);
>  
> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(mm_context_t *mm,
> +		unsigned long ua, unsigned long size)
> +{
> +	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;

I think you could do with a comment here explaining why the lockless
traversal is safe.

> +	list_for_each_entry_lockless(mem, &mm->iommu_group_mem_list, next) {
> +		if ((mem->ua <= ua) &&
> +				(ua + size <= mem->ua +
> +				 (mem->entries << PAGE_SHIFT))) {
> +			ret = mem;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
> +
>  struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>  		unsigned long size)
>  {
> @@ -284,6 +299,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
>  
> +long mm_iommu_rm_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa)
> +{
> +	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> +	void *va = &mem->hpas[entry];
> +	unsigned long *ra;
> +
> +	if (entry >= mem->entries)
> +		return -EFAULT;
> +
> +	ra = (void *) vmalloc_to_phys(va);
> +	if (!ra)
> +		return -EFAULT;
> +
> +	*hpa = *ra | (ua & ~PAGE_MASK);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_rm_ua_to_hpa);
> +
>  long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	if (atomic64_inc_not_zero(&mem->mapped))

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 2/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
@ 2016-03-07  5:30     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  5:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5050 bytes --]

On Mon, Mar 07, 2016 at 02:41:10PM +1100, Alexey Kardashevskiy wrote:
> This makes mm_iommu_lookup() able to work in realmode by replacing
> list_for_each_entry_rcu() (which can do debug stuff which can fail in
> real mode) with list_for_each_entry_lockless().
> 
> This adds realmode version of mm_iommu_ua_to_hpa() which adds
> explicit vmalloc'd-to-linear address conversion.
> Unlike mm_iommu_ua_to_hpa(), mm_iommu_rm_ua_to_hpa() can fail.
> 
> This changes mm_iommu_preregistered() to receive @mm as in real mode
> @current does not always have a correct pointer.

So, I'd generally expect a parameter called @mm to be an mm_struct *,
not a mm_context_t.

> 
> This adds realmode version of mm_iommu_lookup() which receives @mm
> (for the same reason as for mm_iommu_preregistered()) and uses
> lockless version of list_for_each_entry_rcu().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>



> ---
>  arch/powerpc/include/asm/mmu_context.h |  6 ++++-
>  arch/powerpc/mm/mmu_context_iommu.c    | 45 ++++++++++++++++++++++++++++++----
>  2 files changed, 45 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 878c277..3ba652a 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -18,7 +18,7 @@ extern void destroy_context(struct mm_struct *mm);
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>  struct mm_iommu_table_group_mem_t;
>  
> -extern bool mm_iommu_preregistered(void);
> +extern bool mm_iommu_preregistered(mm_context_t *mm);
>  extern long mm_iommu_get(unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem);
>  extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> @@ -26,10 +26,14 @@ extern void mm_iommu_init(mm_context_t *ctx);
>  extern void mm_iommu_cleanup(mm_context_t *ctx);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>  		unsigned long size);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(mm_context_t *mm,
> +		unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
>  		unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned long *hpa);
> +extern long mm_iommu_rm_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index da6a216..aa1565d 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -63,12 +63,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>  	return ret;
>  }
>  
> -bool mm_iommu_preregistered(void)
> +bool mm_iommu_preregistered(mm_context_t *mm)
>  {
> -	if (!current || !current->mm)
> -		return false;
> -
> -	return !list_empty(&current->mm->context.iommu_group_mem_list);
> +	return !list_empty(&mm->iommu_group_mem_list);
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> @@ -231,6 +228,24 @@ unlock_exit:
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_put);
>  
> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(mm_context_t *mm,
> +		unsigned long ua, unsigned long size)
> +{
> +	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;

I think you could do with a comment here explaining why the lockless
traversal is safe.

> +	list_for_each_entry_lockless(mem, &mm->iommu_group_mem_list, next) {
> +		if ((mem->ua <= ua) &&
> +				(ua + size <= mem->ua +
> +				 (mem->entries << PAGE_SHIFT))) {
> +			ret = mem;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
> +
>  struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>  		unsigned long size)
>  {
> @@ -284,6 +299,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
>  
> +long mm_iommu_rm_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa)
> +{
> +	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> +	void *va = &mem->hpas[entry];
> +	unsigned long *ra;
> +
> +	if (entry >= mem->entries)
> +		return -EFAULT;
> +
> +	ra = (void *) vmalloc_to_phys(va);
> +	if (!ra)
> +		return -EFAULT;
> +
> +	*hpa = *ra | (ua & ~PAGE_MASK);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_rm_ua_to_hpa);
> +
>  long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	if (atomic64_inc_not_zero(&mem->mapped))

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-07  6:00     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  6:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4573 bytes --]

On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Ok.. so, what's the benefit of not having to lock the rmap?

> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
>  1 file changed, 70 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 44be73e..af155f6 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
> +{
> +	struct task_struct *task;
> +
> +	task = vcpu->arch.run_task;
> +	if (unlikely(!task || !task->mm))
> +		return NULL;
> +
> +	return &task->mm->context;
> +}
> +
> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> +{
> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
> +
> +	if (unlikely(!mm))
> +		return false;
> +
> +	return mm_iommu_preregistered(mm);
> +}
> +
> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> +{
> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
> +
> +	if (unlikely(!mm))
> +		return NULL;
> +
> +	return mm_iommu_lookup_rm(mm, ua, size);
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> @@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (kvmppc_preregistered(vcpu)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;
>  
> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> +		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
> +			return H_TOO_HARD;
> +	} else {
> +		/*
> +		 * This is emulated devices case.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;
> +
> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)

I don't see where rmap is initialized to NULL in the case where it's
not being used.

> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-03-07  6:00     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  6:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4573 bytes --]

On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Ok.. so, what's the benefit of not having to lock the rmap?

> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
>  1 file changed, 70 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 44be73e..af155f6 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
> +{
> +	struct task_struct *task;
> +
> +	task = vcpu->arch.run_task;
> +	if (unlikely(!task || !task->mm))
> +		return NULL;
> +
> +	return &task->mm->context;
> +}
> +
> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> +{
> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
> +
> +	if (unlikely(!mm))
> +		return false;
> +
> +	return mm_iommu_preregistered(mm);
> +}
> +
> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> +{
> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
> +
> +	if (unlikely(!mm))
> +		return NULL;
> +
> +	return mm_iommu_lookup_rm(mm, ua, size);
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> @@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (kvmppc_preregistered(vcpu)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;
>  
> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> +		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
> +			return H_TOO_HARD;
> +	} else {
> +		/*
> +		 * This is emulated devices case.
> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;
> +
> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)

I don't see where rmap is initialized to NULL in the case where it's
not being used.

> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-07  6:05     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  6:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5742 bytes --]

On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using different
> cache-inhibited store instructions which is different from
> the virtual mode.
> 
> This defines and implements exchange_rm() callback. This does not
> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> exchange/exchange_rm are only to be used by KVM for VFIO.
> 
> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> 
> This replaces list_for_each_entry_rcu with its lockless version as
> from now on pnv_pci_ioda2_tce_invalidate() can be called in
> the real mode too.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h          |  7 +++++++
>  arch/powerpc/kernel/iommu.c               | 15 +++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++++++++++++++++++++++++++-
>  3 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 7b87bab..3ca877a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -64,6 +64,11 @@ struct iommu_table_ops {
>  			long index,
>  			unsigned long *hpa,
>  			enum dma_data_direction *direction);
> +	/* Real mode */
> +	int (*exchange_rm)(struct iommu_table *tbl,
> +			long index,
> +			unsigned long *hpa,
> +			enum dma_data_direction *direction);
>  #endif
>  	void (*clear)(struct iommu_table *tbl,
>  			long index, long npages);
> @@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
>  extern int __init tce_iommu_bus_notifier_init(void);
>  extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>  		unsigned long *hpa, enum dma_data_direction *direction);
> +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction);
>  #else
>  static inline void iommu_register_group(struct iommu_table_group *table_group,
>  					int pci_domain_number,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index a8e3490..2fcc48b 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
>  }
>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>  
> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret;
> +
> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +
> +	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> +			(*direction == DMA_BIDIRECTIONAL)))
> +		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);

>  int iommu_add_device(struct device *dev)
>  {
>  	struct iommu_table *tbl;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c5baaf3..bed1944 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1791,6 +1791,18 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret && (tbl->it_type &
> +			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> +		pnv_pci_ioda1_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif

Both your _rm variants are identical to the non _rm versions.  Why not
just set the function poiinter to the same thing, rather than copying
the whole function.

>  static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> @@ -1806,6 +1818,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.set = pnv_ioda1_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda1_tce_xchg,
> +	.exchange_rm = pnv_ioda1_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda1_tce_free,
>  	.get = pnv_tce_get,
> @@ -1866,7 +1879,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>  {
>  	struct iommu_table_group_link *tgl;
>  
> -	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
> +	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
>  		struct pnv_ioda_pe *npe;
>  		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
>  				struct pnv_ioda_pe, table_group);
> @@ -1918,6 +1931,18 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret && (tbl->it_type &
> +			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> +		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif
>  
>  static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> @@ -1939,6 +1964,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>  	.set = pnv_ioda2_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda2_tce_xchg,
> +	.exchange_rm = pnv_ioda2_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda2_tce_free,
>  	.get = pnv_tce_get,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
@ 2016-03-07  6:05     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  6:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5742 bytes --]

On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using different
> cache-inhibited store instructions which is different from
> the virtual mode.
> 
> This defines and implements exchange_rm() callback. This does not
> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> exchange/exchange_rm are only to be used by KVM for VFIO.
> 
> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> 
> This replaces list_for_each_entry_rcu with its lockless version as
> from now on pnv_pci_ioda2_tce_invalidate() can be called in
> the real mode too.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h          |  7 +++++++
>  arch/powerpc/kernel/iommu.c               | 15 +++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++++++++++++++++++++++++++-
>  3 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 7b87bab..3ca877a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -64,6 +64,11 @@ struct iommu_table_ops {
>  			long index,
>  			unsigned long *hpa,
>  			enum dma_data_direction *direction);
> +	/* Real mode */
> +	int (*exchange_rm)(struct iommu_table *tbl,
> +			long index,
> +			unsigned long *hpa,
> +			enum dma_data_direction *direction);
>  #endif
>  	void (*clear)(struct iommu_table *tbl,
>  			long index, long npages);
> @@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
>  extern int __init tce_iommu_bus_notifier_init(void);
>  extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>  		unsigned long *hpa, enum dma_data_direction *direction);
> +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction);
>  #else
>  static inline void iommu_register_group(struct iommu_table_group *table_group,
>  					int pci_domain_number,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index a8e3490..2fcc48b 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
>  }
>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>  
> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret;
> +
> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +
> +	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> +			(*direction == DMA_BIDIRECTIONAL)))
> +		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);

>  int iommu_add_device(struct device *dev)
>  {
>  	struct iommu_table *tbl;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c5baaf3..bed1944 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1791,6 +1791,18 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret && (tbl->it_type &
> +			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> +		pnv_pci_ioda1_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif

Both your _rm variants are identical to the non _rm versions.  Why not
just set the function poiinter to the same thing, rather than copying
the whole function.

>  static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> @@ -1806,6 +1818,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.set = pnv_ioda1_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda1_tce_xchg,
> +	.exchange_rm = pnv_ioda1_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda1_tce_free,
>  	.get = pnv_tce_get,
> @@ -1866,7 +1879,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>  {
>  	struct iommu_table_group_link *tgl;
>  
> -	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
> +	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
>  		struct pnv_ioda_pe *npe;
>  		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
>  				struct pnv_ioda_pe, table_group);
> @@ -1918,6 +1931,18 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret && (tbl->it_type &
> +			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> +		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif
>  
>  static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> @@ -1939,6 +1964,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>  	.set = pnv_ioda2_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda2_tce_xchg,
> +	.exchange_rm = pnv_ioda2_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda2_tce_free,
>  	.get = pnv_tce_get,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-07  6:25     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  6:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 7711 bytes --]

On Mon, Mar 07, 2016 at 02:41:14PM +1100, Alexey Kardashevskiy wrote:
> The existing in-kernel TCE table for emulated devices contains
> guest physical addresses which are accesses by emulated devices.
> Since we need to keep this information for VFIO devices too
> in order to implement H_GET_TCE, we are reusing it.
> 
> This adds IOMMU group list to kvmppc_spapr_tce_table. Each group
> will have an iommu_table pointer.
> 
> This adds kvm_spapr_tce_attach_iommu_group() helper and its detach
> counterpart to manage the lists.
> 
> This puts a group when:
> - guest copy of TCE table is destroyed when TCE table fd is closed;
> - kvm_spapr_tce_detach_iommu_group() is called from
> the KVM_DEV_VFIO_GROUP_DEL ioctl handler in the case vfio-pci hotunplug
> (will be added in the following patch).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/kvm_host.h |   8 +++
>  arch/powerpc/include/asm/kvm_ppc.h  |   6 ++
>  arch/powerpc/kvm/book3s_64_vio.c    | 108 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 122 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 2e7c791..2c5c823 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -178,6 +178,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_group {
> +	struct list_head next;
> +	struct rcu_head rcu;
> +	struct iommu_group *refgrp;/* for reference counting only */
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -186,6 +193,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head groups;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 2544eda..d1482dc 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -164,6 +164,12 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>  
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> +				unsigned long liobn,
> +				phys_addr_t start_addr,
> +				struct iommu_group *grp);
> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> +				struct iommu_group *grp);
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 2c2d103..846d16d 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -95,10 +96,18 @@ static void release_spapr_tce_table(struct rcu_head *head)
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>  			struct kvmppc_spapr_tce_table, rcu);
>  	unsigned long i, npages = kvmppc_tce_pages(stt->size);
> +	struct kvmppc_spapr_tce_group *kg;
>  
>  	for (i = 0; i < npages; i++)
>  		__free_page(stt->pages[i]);
>  
> +	while (!list_empty(&stt->groups)) {
> +		kg = list_first_entry(&stt->groups,
> +				struct kvmppc_spapr_tce_group, next);
> +		list_del(&kg->next);
> +		kfree(kg);
> +	}
> +
>  	kfree(stt);
>  }
>  
> @@ -129,9 +138,15 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
>  static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  {
>  	struct kvmppc_spapr_tce_table *stt = filp->private_data;
> +	struct kvmppc_spapr_tce_group *kg;
>  
>  	list_del_rcu(&stt->list);
>  
> +	list_for_each_entry_rcu(kg, &stt->groups, next)	{
> +		iommu_group_put(kg->refgrp);
> +		kg->refgrp = NULL;
> +	}

What's the reason for this kind of two-phase deletion?  Dereffing the
group here, and setting to NULL, then actually removing from the liast above.


>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -146,6 +161,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
>  	.release	= kvm_spapr_tce_release,
>  };
>  
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> +				unsigned long liobn,
> +				phys_addr_t start_addr,
> +				struct iommu_group *grp)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	struct iommu_table_group *table_group;
> +	long i;
> +	bool found = false;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp;
> +
> +	/* Check this LIOBN hasn't been previously allocated */

This comment does not appear to be correct.

> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt->liobn == liobn) {
> +			if ((stt->offset << stt->page_shift) != start_addr)
> +				return -EINVAL;
> +
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	if (!found)
> +		return -ENODEV;
> +
> +	/* Find IOMMU group and table at @start_addr */
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (!table_group)
> +		return -EFAULT;
> +
> +	tbltmp = NULL;
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbl = table_group->tables[i];
> +
> +		if (!tbl)
> +			continue;
> +
> +		if ((tbl->it_page_shift == stt->page_shift) &&
> +				(tbl->it_offset == stt->offset)) {
> +			tbltmp = tbl;
> +			break;
> +		}
> +	}
> +	if (!tbltmp)
> +		return -ENODEV;
> +
> +	list_for_each_entry_rcu(kg, &stt->groups, next) {
> +		if (kg->refgrp == grp)
> +			return -EBUSY;
> +	}
> +
> +	kg = kzalloc(sizeof(*kg), GFP_KERNEL);
> +	kg->refgrp = grp;
> +	kg->tbl = tbltmp;
> +	list_add_rcu(&kg->next, &stt->groups);
> +
> +	return 0;
> +}
> +
> +static void kvm_spapr_tce_put_group(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_group *kg = container_of(head,
> +			struct kvmppc_spapr_tce_group, rcu);
> +
> +	iommu_group_put(kg->refgrp);
> +	kg->refgrp = NULL;
> +	kfree(kg);
> +}
> +
> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> +				struct iommu_group *grp)

Hrm.  attach takes an explicit liobn, but this one iterates over all
liobns.  Why the asymmetry?

> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	struct iommu_table_group *table_group;
> +	struct kvmppc_spapr_tce_group *kg;
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (!table_group)
> +		return;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		list_for_each_entry_rcu(kg, &stt->groups, next) {
> +			if (kg->refgrp == grp) {
> +				list_del_rcu(&kg->next);
> +				call_rcu(&kg->rcu, kvm_spapr_tce_put_group);
> +				break;
> +			}
> +		}
> +	}
> +}
> +
>  long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				   struct kvm_create_spapr_tce_64 *args)
>  {
> @@ -181,6 +288,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->groups);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table
@ 2016-03-07  6:25     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-07  6:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 7711 bytes --]

On Mon, Mar 07, 2016 at 02:41:14PM +1100, Alexey Kardashevskiy wrote:
> The existing in-kernel TCE table for emulated devices contains
> guest physical addresses which are accesses by emulated devices.
> Since we need to keep this information for VFIO devices too
> in order to implement H_GET_TCE, we are reusing it.
> 
> This adds IOMMU group list to kvmppc_spapr_tce_table. Each group
> will have an iommu_table pointer.
> 
> This adds kvm_spapr_tce_attach_iommu_group() helper and its detach
> counterpart to manage the lists.
> 
> This puts a group when:
> - guest copy of TCE table is destroyed when TCE table fd is closed;
> - kvm_spapr_tce_detach_iommu_group() is called from
> the KVM_DEV_VFIO_GROUP_DEL ioctl handler in the case vfio-pci hotunplug
> (will be added in the following patch).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/kvm_host.h |   8 +++
>  arch/powerpc/include/asm/kvm_ppc.h  |   6 ++
>  arch/powerpc/kvm/book3s_64_vio.c    | 108 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 122 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 2e7c791..2c5c823 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -178,6 +178,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_group {
> +	struct list_head next;
> +	struct rcu_head rcu;
> +	struct iommu_group *refgrp;/* for reference counting only */
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -186,6 +193,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head groups;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 2544eda..d1482dc 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -164,6 +164,12 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>  
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> +				unsigned long liobn,
> +				phys_addr_t start_addr,
> +				struct iommu_group *grp);
> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> +				struct iommu_group *grp);
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 2c2d103..846d16d 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -95,10 +96,18 @@ static void release_spapr_tce_table(struct rcu_head *head)
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>  			struct kvmppc_spapr_tce_table, rcu);
>  	unsigned long i, npages = kvmppc_tce_pages(stt->size);
> +	struct kvmppc_spapr_tce_group *kg;
>  
>  	for (i = 0; i < npages; i++)
>  		__free_page(stt->pages[i]);
>  
> +	while (!list_empty(&stt->groups)) {
> +		kg = list_first_entry(&stt->groups,
> +				struct kvmppc_spapr_tce_group, next);
> +		list_del(&kg->next);
> +		kfree(kg);
> +	}
> +
>  	kfree(stt);
>  }
>  
> @@ -129,9 +138,15 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
>  static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  {
>  	struct kvmppc_spapr_tce_table *stt = filp->private_data;
> +	struct kvmppc_spapr_tce_group *kg;
>  
>  	list_del_rcu(&stt->list);
>  
> +	list_for_each_entry_rcu(kg, &stt->groups, next)	{
> +		iommu_group_put(kg->refgrp);
> +		kg->refgrp = NULL;
> +	}

What's the reason for this kind of two-phase deletion?  Dereffing the
group here, and setting to NULL, then actually removing from the liast above.


>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -146,6 +161,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
>  	.release	= kvm_spapr_tce_release,
>  };
>  
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> +				unsigned long liobn,
> +				phys_addr_t start_addr,
> +				struct iommu_group *grp)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	struct iommu_table_group *table_group;
> +	long i;
> +	bool found = false;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp;
> +
> +	/* Check this LIOBN hasn't been previously allocated */

This comment does not appear to be correct.

> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt->liobn == liobn) {
> +			if ((stt->offset << stt->page_shift) != start_addr)
> +				return -EINVAL;
> +
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	if (!found)
> +		return -ENODEV;
> +
> +	/* Find IOMMU group and table at @start_addr */
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (!table_group)
> +		return -EFAULT;
> +
> +	tbltmp = NULL;
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbl = table_group->tables[i];
> +
> +		if (!tbl)
> +			continue;
> +
> +		if ((tbl->it_page_shift == stt->page_shift) &&
> +				(tbl->it_offset == stt->offset)) {
> +			tbltmp = tbl;
> +			break;
> +		}
> +	}
> +	if (!tbltmp)
> +		return -ENODEV;
> +
> +	list_for_each_entry_rcu(kg, &stt->groups, next) {
> +		if (kg->refgrp == grp)
> +			return -EBUSY;
> +	}
> +
> +	kg = kzalloc(sizeof(*kg), GFP_KERNEL);
> +	kg->refgrp = grp;
> +	kg->tbl = tbltmp;
> +	list_add_rcu(&kg->next, &stt->groups);
> +
> +	return 0;
> +}
> +
> +static void kvm_spapr_tce_put_group(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_group *kg = container_of(head,
> +			struct kvmppc_spapr_tce_group, rcu);
> +
> +	iommu_group_put(kg->refgrp);
> +	kg->refgrp = NULL;
> +	kfree(kg);
> +}
> +
> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> +				struct iommu_group *grp)

Hrm.  attach takes an explicit liobn, but this one iterates over all
liobns.  Why the asymmetry?

> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	struct iommu_table_group *table_group;
> +	struct kvmppc_spapr_tce_group *kg;
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (!table_group)
> +		return;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		list_for_each_entry_rcu(kg, &stt->groups, next) {
> +			if (kg->refgrp == grp) {
> +				list_del_rcu(&kg->next);
> +				call_rcu(&kg->rcu, kvm_spapr_tce_put_group);
> +				break;
> +			}
> +		}
> +	}
> +}
> +
>  long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				   struct kvm_create_spapr_tce_64 *args)
>  {
> @@ -181,6 +288,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->groups);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
  2016-03-07  6:05     ` David Gibson
@ 2016-03-07  7:32       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  7:32 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/07/2016 05:05 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
>> In real mode, TCE tables are invalidated using different
>> cache-inhibited store instructions which is different from
>> the virtual mode.
>>
>> This defines and implements exchange_rm() callback. This does not
>> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
>> exchange/exchange_rm are only to be used by KVM for VFIO.
>>
>> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
>>
>> This replaces list_for_each_entry_rcu with its lockless version as
>> from now on pnv_pci_ioda2_tce_invalidate() can be called in
>> the real mode too.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   arch/powerpc/include/asm/iommu.h          |  7 +++++++
>>   arch/powerpc/kernel/iommu.c               | 15 +++++++++++++++
>>   arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++++++++++++++++++++++++++-
>>   3 files changed, 49 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 7b87bab..3ca877a 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -64,6 +64,11 @@ struct iommu_table_ops {
>>   			long index,
>>   			unsigned long *hpa,
>>   			enum dma_data_direction *direction);
>> +	/* Real mode */
>> +	int (*exchange_rm)(struct iommu_table *tbl,
>> +			long index,
>> +			unsigned long *hpa,
>> +			enum dma_data_direction *direction);
>>   #endif
>>   	void (*clear)(struct iommu_table *tbl,
>>   			long index, long npages);
>> @@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
>>   extern int __init tce_iommu_bus_notifier_init(void);
>>   extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>>   		unsigned long *hpa, enum dma_data_direction *direction);
>> +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
>> +		unsigned long *hpa, enum dma_data_direction *direction);
>>   #else
>>   static inline void iommu_register_group(struct iommu_table_group *table_group,
>>   					int pci_domain_number,
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index a8e3490..2fcc48b 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_release_ownership);
>>
>> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
>> +		unsigned long *hpa, enum dma_data_direction *direction)
>> +{
>> +	long ret;
>> +
>> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
>> +
>> +	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
>> +			(*direction == DMA_BIDIRECTIONAL)))
>> +		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
>
>>   int iommu_add_device(struct device *dev)
>>   {
>>   	struct iommu_table *tbl;
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index c5baaf3..bed1944 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1791,6 +1791,18 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
>>
>>   	return ret;
>>   }
>> +
>> +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
>> +		unsigned long *hpa, enum dma_data_direction *direction)
>> +{
>> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
>> +
>> +	if (!ret && (tbl->it_type &
>> +			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
>> +		pnv_pci_ioda1_tce_invalidate(tbl, index, 1, true);
>> +
>> +	return ret;
>> +}
>>   #endif
>
> Both your _rm variants are identical to the non _rm versions.  Why not
> just set the function poiinter to the same thing, rather than copying
> the whole function.


The last parameter - "rm" - to pnv_pci_ioda1_tce_invalidate() is different.


>
>>   static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
>> @@ -1806,6 +1818,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>>   	.set = pnv_ioda1_tce_build,
>>   #ifdef CONFIG_IOMMU_API
>>   	.exchange = pnv_ioda1_tce_xchg,
>> +	.exchange_rm = pnv_ioda1_tce_xchg_rm,
>>   #endif
>>   	.clear = pnv_ioda1_tce_free,
>>   	.get = pnv_tce_get,
>> @@ -1866,7 +1879,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>>   {
>>   	struct iommu_table_group_link *tgl;
>>
>> -	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
>> +	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
>>   		struct pnv_ioda_pe *npe;
>>   		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
>>   				struct pnv_ioda_pe, table_group);
>> @@ -1918,6 +1931,18 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
>>
>>   	return ret;
>>   }
>> +
>> +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
>> +		unsigned long *hpa, enum dma_data_direction *direction)
>> +{
>> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
>> +
>> +	if (!ret && (tbl->it_type &
>> +			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
>> +		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
>> +
>> +	return ret;
>> +}
>>   #endif
>>
>>   static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>> @@ -1939,6 +1964,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>>   	.set = pnv_ioda2_tce_build,
>>   #ifdef CONFIG_IOMMU_API
>>   	.exchange = pnv_ioda2_tce_xchg,
>> +	.exchange_rm = pnv_ioda2_tce_xchg_rm,
>>   #endif
>>   	.clear = pnv_ioda2_tce_free,
>>   	.get = pnv_tce_get,
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
@ 2016-03-07  7:32       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  7:32 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/07/2016 05:05 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
>> In real mode, TCE tables are invalidated using different
>> cache-inhibited store instructions which is different from
>> the virtual mode.
>>
>> This defines and implements exchange_rm() callback. This does not
>> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
>> exchange/exchange_rm are only to be used by KVM for VFIO.
>>
>> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
>>
>> This replaces list_for_each_entry_rcu with its lockless version as
>> from now on pnv_pci_ioda2_tce_invalidate() can be called in
>> the real mode too.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   arch/powerpc/include/asm/iommu.h          |  7 +++++++
>>   arch/powerpc/kernel/iommu.c               | 15 +++++++++++++++
>>   arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++++++++++++++++++++++++++-
>>   3 files changed, 49 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 7b87bab..3ca877a 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -64,6 +64,11 @@ struct iommu_table_ops {
>>   			long index,
>>   			unsigned long *hpa,
>>   			enum dma_data_direction *direction);
>> +	/* Real mode */
>> +	int (*exchange_rm)(struct iommu_table *tbl,
>> +			long index,
>> +			unsigned long *hpa,
>> +			enum dma_data_direction *direction);
>>   #endif
>>   	void (*clear)(struct iommu_table *tbl,
>>   			long index, long npages);
>> @@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
>>   extern int __init tce_iommu_bus_notifier_init(void);
>>   extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>>   		unsigned long *hpa, enum dma_data_direction *direction);
>> +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
>> +		unsigned long *hpa, enum dma_data_direction *direction);
>>   #else
>>   static inline void iommu_register_group(struct iommu_table_group *table_group,
>>   					int pci_domain_number,
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index a8e3490..2fcc48b 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_release_ownership);
>>
>> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
>> +		unsigned long *hpa, enum dma_data_direction *direction)
>> +{
>> +	long ret;
>> +
>> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
>> +
>> +	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
>> +			(*direction = DMA_BIDIRECTIONAL)))
>> +		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
>
>>   int iommu_add_device(struct device *dev)
>>   {
>>   	struct iommu_table *tbl;
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index c5baaf3..bed1944 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1791,6 +1791,18 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
>>
>>   	return ret;
>>   }
>> +
>> +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
>> +		unsigned long *hpa, enum dma_data_direction *direction)
>> +{
>> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
>> +
>> +	if (!ret && (tbl->it_type &
>> +			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
>> +		pnv_pci_ioda1_tce_invalidate(tbl, index, 1, true);
>> +
>> +	return ret;
>> +}
>>   #endif
>
> Both your _rm variants are identical to the non _rm versions.  Why not
> just set the function poiinter to the same thing, rather than copying
> the whole function.


The last parameter - "rm" - to pnv_pci_ioda1_tce_invalidate() is different.


>
>>   static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
>> @@ -1806,6 +1818,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>>   	.set = pnv_ioda1_tce_build,
>>   #ifdef CONFIG_IOMMU_API
>>   	.exchange = pnv_ioda1_tce_xchg,
>> +	.exchange_rm = pnv_ioda1_tce_xchg_rm,
>>   #endif
>>   	.clear = pnv_ioda1_tce_free,
>>   	.get = pnv_tce_get,
>> @@ -1866,7 +1879,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>>   {
>>   	struct iommu_table_group_link *tgl;
>>
>> -	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
>> +	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
>>   		struct pnv_ioda_pe *npe;
>>   		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
>>   				struct pnv_ioda_pe, table_group);
>> @@ -1918,6 +1931,18 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
>>
>>   	return ret;
>>   }
>> +
>> +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
>> +		unsigned long *hpa, enum dma_data_direction *direction)
>> +{
>> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
>> +
>> +	if (!ret && (tbl->it_type &
>> +			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
>> +		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
>> +
>> +	return ret;
>> +}
>>   #endif
>>
>>   static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>> @@ -1939,6 +1964,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>>   	.set = pnv_ioda2_tce_build,
>>   #ifdef CONFIG_IOMMU_API
>>   	.exchange = pnv_ioda2_tce_xchg,
>> +	.exchange_rm = pnv_ioda2_tce_xchg_rm,
>>   #endif
>>   	.clear = pnv_ioda2_tce_free,
>>   	.get = pnv_tce_get,
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table
  2016-03-07  6:25     ` David Gibson
@ 2016-03-07  9:38       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  9:38 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/07/2016 05:25 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:14PM +1100, Alexey Kardashevskiy wrote:
>> The existing in-kernel TCE table for emulated devices contains
>> guest physical addresses which are accesses by emulated devices.
>> Since we need to keep this information for VFIO devices too
>> in order to implement H_GET_TCE, we are reusing it.
>>
>> This adds IOMMU group list to kvmppc_spapr_tce_table. Each group
>> will have an iommu_table pointer.
>>
>> This adds kvm_spapr_tce_attach_iommu_group() helper and its detach
>> counterpart to manage the lists.
>>
>> This puts a group when:
>> - guest copy of TCE table is destroyed when TCE table fd is closed;
>> - kvm_spapr_tce_detach_iommu_group() is called from
>> the KVM_DEV_VFIO_GROUP_DEL ioctl handler in the case vfio-pci hotunplug
>> (will be added in the following patch).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   arch/powerpc/include/asm/kvm_host.h |   8 +++
>>   arch/powerpc/include/asm/kvm_ppc.h  |   6 ++
>>   arch/powerpc/kvm/book3s_64_vio.c    | 108 ++++++++++++++++++++++++++++++++++++
>>   3 files changed, 122 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 2e7c791..2c5c823 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -178,6 +178,13 @@ struct kvmppc_pginfo {
>>   	atomic_t refcnt;
>>   };
>>
>> +struct kvmppc_spapr_tce_group {
>> +	struct list_head next;
>> +	struct rcu_head rcu;
>> +	struct iommu_group *refgrp;/* for reference counting only */
>> +	struct iommu_table *tbl;
>> +};
>> +
>>   struct kvmppc_spapr_tce_table {
>>   	struct list_head list;
>>   	struct kvm *kvm;
>> @@ -186,6 +193,7 @@ struct kvmppc_spapr_tce_table {
>>   	u32 page_shift;
>>   	u64 offset;		/* in pages */
>>   	u64 size;		/* window size in pages */
>> +	struct list_head groups;
>>   	struct page *pages[0];
>>   };
>>
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 2544eda..d1482dc 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -164,6 +164,12 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>   			struct kvm_memory_slot *memslot, unsigned long porder);
>>   extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>>
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
>> +				unsigned long liobn,
>> +				phys_addr_t start_addr,
>> +				struct iommu_group *grp);
>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
>> +				struct iommu_group *grp);
>>   extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>   				struct kvm_create_spapr_tce_64 *args);
>>   extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 2c2d103..846d16d 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,7 @@
>>   #include <linux/hugetlb.h>
>>   #include <linux/list.h>
>>   #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>>
>>   #include <asm/tlbflush.h>
>>   #include <asm/kvm_ppc.h>
>> @@ -95,10 +96,18 @@ static void release_spapr_tce_table(struct rcu_head *head)
>>   	struct kvmppc_spapr_tce_table *stt = container_of(head,
>>   			struct kvmppc_spapr_tce_table, rcu);
>>   	unsigned long i, npages = kvmppc_tce_pages(stt->size);
>> +	struct kvmppc_spapr_tce_group *kg;
>>
>>   	for (i = 0; i < npages; i++)
>>   		__free_page(stt->pages[i]);
>>
>> +	while (!list_empty(&stt->groups)) {
>> +		kg = list_first_entry(&stt->groups,
>> +				struct kvmppc_spapr_tce_group, next);
>> +		list_del(&kg->next);
>> +		kfree(kg);
>> +	}
>> +
>>   	kfree(stt);
>>   }
>>
>> @@ -129,9 +138,15 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
>>   static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>   {
>>   	struct kvmppc_spapr_tce_table *stt = filp->private_data;
>> +	struct kvmppc_spapr_tce_group *kg;
>>
>>   	list_del_rcu(&stt->list);
>>
>> +	list_for_each_entry_rcu(kg, &stt->groups, next)	{
>> +		iommu_group_put(kg->refgrp);
>> +		kg->refgrp = NULL;
>> +	}
>
> What's the reason for this kind of two-phase deletion?  Dereffing the
> group here, and setting to NULL, then actually removing from the liast above.

Well, this way I have only one RCU-delayed release_spapr_tce_table(). The 
other option would be to call for each @kg:
- list_del(&kg->next);
- call_rcu()

as release_spapr_tce_table() won't be able to delete them - they are not in 
the list anymore.

I suppose I can reuse kvm_spapr_tce_put_group(), this looks inaccurate...



>
>>   	kvm_put_kvm(stt->kvm);
>>
>>   	kvmppc_account_memlimit(
>> @@ -146,6 +161,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
>>   	.release	= kvm_spapr_tce_release,
>>   };
>>
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
>> +				unsigned long liobn,
>> +				phys_addr_t start_addr,
>> +				struct iommu_group *grp)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i;
>> +	bool found = false;
>> +	struct kvmppc_spapr_tce_group *kg;
>> +	struct iommu_table *tbltmp;
>> +
>> +	/* Check this LIOBN hasn't been previously allocated */
>
> This comment does not appear to be correct.
>
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt->liobn == liobn) {
>> +			if ((stt->offset << stt->page_shift) != start_addr)
>> +				return -EINVAL;
>> +
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!found)
>> +		return -ENODEV;
>> +
>> +	/* Find IOMMU group and table at @start_addr */
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (!table_group)
>> +		return -EFAULT;
>> +
>> +	tbltmp = NULL;
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbl = table_group->tables[i];
>> +
>> +		if (!tbl)
>> +			continue;
>> +
>> +		if ((tbl->it_page_shift == stt->page_shift) &&
>> +				(tbl->it_offset == stt->offset)) {
>> +			tbltmp = tbl;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbltmp)
>> +		return -ENODEV;
>> +
>> +	list_for_each_entry_rcu(kg, &stt->groups, next) {
>> +		if (kg->refgrp == grp)
>> +			return -EBUSY;
>> +	}
>> +
>> +	kg = kzalloc(sizeof(*kg), GFP_KERNEL);
>> +	kg->refgrp = grp;
>> +	kg->tbl = tbltmp;
>> +	list_add_rcu(&kg->next, &stt->groups);
>> +
>> +	return 0;
>> +}
>> +
>> +static void kvm_spapr_tce_put_group(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_group *kg = container_of(head,
>> +			struct kvmppc_spapr_tce_group, rcu);
>> +
>> +	iommu_group_put(kg->refgrp);
>> +	kg->refgrp = NULL;
>> +	kfree(kg);
>> +}
>> +
>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
>> +				struct iommu_group *grp)
>
> Hrm.  attach takes an explicit liobn, but this one iterates over all
> liobns.  Why the asymmetry?


For attach(), LIOBN is specified in an additional (to VFIO KVM device's 
"add group") ioctl(). There is no need for "detach" ioctl() as we only want 
this detach() to happen when a group is removed from a container, and in 
this case the usual KVM_DEV_VFIO_GROUP_DEL is good enough hint that we need 
to detach LIOBN. Since _DEL does not take LIOBN, here I have a loop.

I'll put this in the commit log next time.


>
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	struct iommu_table_group *table_group;
>> +	struct kvmppc_spapr_tce_group *kg;
>> +
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (!table_group)
>> +		return;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		list_for_each_entry_rcu(kg, &stt->groups, next) {
>> +			if (kg->refgrp == grp) {
>> +				list_del_rcu(&kg->next);
>> +				call_rcu(&kg->rcu, kvm_spapr_tce_put_group);
>> +				break;
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>   long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>   				   struct kvm_create_spapr_tce_64 *args)
>>   {
>> @@ -181,6 +288,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>   	stt->offset = args->offset;
>>   	stt->size = size;
>>   	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->groups);
>>
>>   	for (i = 0; i < npages; i++) {
>>   		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table
@ 2016-03-07  9:38       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-07  9:38 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/07/2016 05:25 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:14PM +1100, Alexey Kardashevskiy wrote:
>> The existing in-kernel TCE table for emulated devices contains
>> guest physical addresses which are accesses by emulated devices.
>> Since we need to keep this information for VFIO devices too
>> in order to implement H_GET_TCE, we are reusing it.
>>
>> This adds IOMMU group list to kvmppc_spapr_tce_table. Each group
>> will have an iommu_table pointer.
>>
>> This adds kvm_spapr_tce_attach_iommu_group() helper and its detach
>> counterpart to manage the lists.
>>
>> This puts a group when:
>> - guest copy of TCE table is destroyed when TCE table fd is closed;
>> - kvm_spapr_tce_detach_iommu_group() is called from
>> the KVM_DEV_VFIO_GROUP_DEL ioctl handler in the case vfio-pci hotunplug
>> (will be added in the following patch).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   arch/powerpc/include/asm/kvm_host.h |   8 +++
>>   arch/powerpc/include/asm/kvm_ppc.h  |   6 ++
>>   arch/powerpc/kvm/book3s_64_vio.c    | 108 ++++++++++++++++++++++++++++++++++++
>>   3 files changed, 122 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 2e7c791..2c5c823 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -178,6 +178,13 @@ struct kvmppc_pginfo {
>>   	atomic_t refcnt;
>>   };
>>
>> +struct kvmppc_spapr_tce_group {
>> +	struct list_head next;
>> +	struct rcu_head rcu;
>> +	struct iommu_group *refgrp;/* for reference counting only */
>> +	struct iommu_table *tbl;
>> +};
>> +
>>   struct kvmppc_spapr_tce_table {
>>   	struct list_head list;
>>   	struct kvm *kvm;
>> @@ -186,6 +193,7 @@ struct kvmppc_spapr_tce_table {
>>   	u32 page_shift;
>>   	u64 offset;		/* in pages */
>>   	u64 size;		/* window size in pages */
>> +	struct list_head groups;
>>   	struct page *pages[0];
>>   };
>>
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 2544eda..d1482dc 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -164,6 +164,12 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>   			struct kvm_memory_slot *memslot, unsigned long porder);
>>   extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>>
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
>> +				unsigned long liobn,
>> +				phys_addr_t start_addr,
>> +				struct iommu_group *grp);
>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
>> +				struct iommu_group *grp);
>>   extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>   				struct kvm_create_spapr_tce_64 *args);
>>   extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 2c2d103..846d16d 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,7 @@
>>   #include <linux/hugetlb.h>
>>   #include <linux/list.h>
>>   #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>>
>>   #include <asm/tlbflush.h>
>>   #include <asm/kvm_ppc.h>
>> @@ -95,10 +96,18 @@ static void release_spapr_tce_table(struct rcu_head *head)
>>   	struct kvmppc_spapr_tce_table *stt = container_of(head,
>>   			struct kvmppc_spapr_tce_table, rcu);
>>   	unsigned long i, npages = kvmppc_tce_pages(stt->size);
>> +	struct kvmppc_spapr_tce_group *kg;
>>
>>   	for (i = 0; i < npages; i++)
>>   		__free_page(stt->pages[i]);
>>
>> +	while (!list_empty(&stt->groups)) {
>> +		kg = list_first_entry(&stt->groups,
>> +				struct kvmppc_spapr_tce_group, next);
>> +		list_del(&kg->next);
>> +		kfree(kg);
>> +	}
>> +
>>   	kfree(stt);
>>   }
>>
>> @@ -129,9 +138,15 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
>>   static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>   {
>>   	struct kvmppc_spapr_tce_table *stt = filp->private_data;
>> +	struct kvmppc_spapr_tce_group *kg;
>>
>>   	list_del_rcu(&stt->list);
>>
>> +	list_for_each_entry_rcu(kg, &stt->groups, next)	{
>> +		iommu_group_put(kg->refgrp);
>> +		kg->refgrp = NULL;
>> +	}
>
> What's the reason for this kind of two-phase deletion?  Dereffing the
> group here, and setting to NULL, then actually removing from the liast above.

Well, this way I have only one RCU-delayed release_spapr_tce_table(). The 
other option would be to call for each @kg:
- list_del(&kg->next);
- call_rcu()

as release_spapr_tce_table() won't be able to delete them - they are not in 
the list anymore.

I suppose I can reuse kvm_spapr_tce_put_group(), this looks inaccurate...



>
>>   	kvm_put_kvm(stt->kvm);
>>
>>   	kvmppc_account_memlimit(
>> @@ -146,6 +161,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
>>   	.release	= kvm_spapr_tce_release,
>>   };
>>
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
>> +				unsigned long liobn,
>> +				phys_addr_t start_addr,
>> +				struct iommu_group *grp)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i;
>> +	bool found = false;
>> +	struct kvmppc_spapr_tce_group *kg;
>> +	struct iommu_table *tbltmp;
>> +
>> +	/* Check this LIOBN hasn't been previously allocated */
>
> This comment does not appear to be correct.
>
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt->liobn = liobn) {
>> +			if ((stt->offset << stt->page_shift) != start_addr)
>> +				return -EINVAL;
>> +
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!found)
>> +		return -ENODEV;
>> +
>> +	/* Find IOMMU group and table at @start_addr */
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (!table_group)
>> +		return -EFAULT;
>> +
>> +	tbltmp = NULL;
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbl = table_group->tables[i];
>> +
>> +		if (!tbl)
>> +			continue;
>> +
>> +		if ((tbl->it_page_shift = stt->page_shift) &&
>> +				(tbl->it_offset = stt->offset)) {
>> +			tbltmp = tbl;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbltmp)
>> +		return -ENODEV;
>> +
>> +	list_for_each_entry_rcu(kg, &stt->groups, next) {
>> +		if (kg->refgrp = grp)
>> +			return -EBUSY;
>> +	}
>> +
>> +	kg = kzalloc(sizeof(*kg), GFP_KERNEL);
>> +	kg->refgrp = grp;
>> +	kg->tbl = tbltmp;
>> +	list_add_rcu(&kg->next, &stt->groups);
>> +
>> +	return 0;
>> +}
>> +
>> +static void kvm_spapr_tce_put_group(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_group *kg = container_of(head,
>> +			struct kvmppc_spapr_tce_group, rcu);
>> +
>> +	iommu_group_put(kg->refgrp);
>> +	kg->refgrp = NULL;
>> +	kfree(kg);
>> +}
>> +
>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
>> +				struct iommu_group *grp)
>
> Hrm.  attach takes an explicit liobn, but this one iterates over all
> liobns.  Why the asymmetry?


For attach(), LIOBN is specified in an additional (to VFIO KVM device's 
"add group") ioctl(). There is no need for "detach" ioctl() as we only want 
this detach() to happen when a group is removed from a container, and in 
this case the usual KVM_DEV_VFIO_GROUP_DEL is good enough hint that we need 
to detach LIOBN. Since _DEL does not take LIOBN, here I have a loop.

I'll put this in the commit log next time.


>
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	struct iommu_table_group *table_group;
>> +	struct kvmppc_spapr_tce_group *kg;
>> +
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (!table_group)
>> +		return;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		list_for_each_entry_rcu(kg, &stt->groups, next) {
>> +			if (kg->refgrp = grp) {
>> +				list_del_rcu(&kg->next);
>> +				call_rcu(&kg->rcu, kvm_spapr_tce_put_group);
>> +				break;
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>   long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>   				   struct kvm_create_spapr_tce_64 *args)
>>   {
>> @@ -181,6 +288,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>   	stt->offset = args->offset;
>>   	stt->size = size;
>>   	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->groups);
>>
>>   	for (i = 0; i < npages; i++) {
>>   		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
  2016-03-07  7:32       ` Alexey Kardashevskiy
@ 2016-03-08  4:50         ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08  4:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 6336 bytes --]

On Mon, Mar 07, 2016 at 06:32:23PM +1100, Alexey Kardashevskiy wrote:
> On 03/07/2016 05:05 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
> >>In real mode, TCE tables are invalidated using different
> >>cache-inhibited store instructions which is different from
> >>the virtual mode.
> >>
> >>This defines and implements exchange_rm() callback. This does not
> >>define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> >>exchange/exchange_rm are only to be used by KVM for VFIO.
> >>
> >>The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> >>
> >>This replaces list_for_each_entry_rcu with its lockless version as
> >>from now on pnv_pci_ioda2_tce_invalidate() can be called in
> >>the real mode too.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  arch/powerpc/include/asm/iommu.h          |  7 +++++++
> >>  arch/powerpc/kernel/iommu.c               | 15 +++++++++++++++
> >>  arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++++++++++++++++++++++++++-
> >>  3 files changed, 49 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index 7b87bab..3ca877a 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -64,6 +64,11 @@ struct iommu_table_ops {
> >>  			long index,
> >>  			unsigned long *hpa,
> >>  			enum dma_data_direction *direction);
> >>+	/* Real mode */
> >>+	int (*exchange_rm)(struct iommu_table *tbl,
> >>+			long index,
> >>+			unsigned long *hpa,
> >>+			enum dma_data_direction *direction);
> >>  #endif
> >>  	void (*clear)(struct iommu_table *tbl,
> >>  			long index, long npages);
> >>@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
> >>  extern int __init tce_iommu_bus_notifier_init(void);
> >>  extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> >>  		unsigned long *hpa, enum dma_data_direction *direction);
> >>+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> >>+		unsigned long *hpa, enum dma_data_direction *direction);
> >>  #else
> >>  static inline void iommu_register_group(struct iommu_table_group *table_group,
> >>  					int pci_domain_number,
> >>diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >>index a8e3490..2fcc48b 100644
> >>--- a/arch/powerpc/kernel/iommu.c
> >>+++ b/arch/powerpc/kernel/iommu.c
> >>@@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
> >>
> >>+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> >>+		unsigned long *hpa, enum dma_data_direction *direction)
> >>+{
> >>+	long ret;
> >>+
> >>+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> >>+
> >>+	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> >>+			(*direction == DMA_BIDIRECTIONAL)))
> >>+		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
> >>+
> >>+	return ret;
> >>+}
> >>+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
> >
> >>  int iommu_add_device(struct device *dev)
> >>  {
> >>  	struct iommu_table *tbl;
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index c5baaf3..bed1944 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1791,6 +1791,18 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
> >>
> >>  	return ret;
> >>  }
> >>+
> >>+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
> >>+		unsigned long *hpa, enum dma_data_direction *direction)
> >>+{
> >>+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> >>+
> >>+	if (!ret && (tbl->it_type &
> >>+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> >>+		pnv_pci_ioda1_tce_invalidate(tbl, index, 1, true);
> >>+
> >>+	return ret;
> >>+}
> >>  #endif
> >
> >Both your _rm variants are identical to the non _rm versions.  Why not
> >just set the function poiinter to the same thing, rather than copying
> >the whole function.
> 
> 
> The last parameter - "rm" - to pnv_pci_ioda1_tce_invalidate() is
> different.

Ah, missed that, sorry.

> 
> 
> >
> >>  static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> >>@@ -1806,6 +1818,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
> >>  	.set = pnv_ioda1_tce_build,
> >>  #ifdef CONFIG_IOMMU_API
> >>  	.exchange = pnv_ioda1_tce_xchg,
> >>+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
> >>  #endif
> >>  	.clear = pnv_ioda1_tce_free,
> >>  	.get = pnv_tce_get,
> >>@@ -1866,7 +1879,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> >>  {
> >>  	struct iommu_table_group_link *tgl;
> >>
> >>-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
> >>+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
> >>  		struct pnv_ioda_pe *npe;
> >>  		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
> >>  				struct pnv_ioda_pe, table_group);
> >>@@ -1918,6 +1931,18 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
> >>
> >>  	return ret;
> >>  }
> >>+
> >>+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
> >>+		unsigned long *hpa, enum dma_data_direction *direction)
> >>+{
> >>+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> >>+
> >>+	if (!ret && (tbl->it_type &
> >>+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> >>+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
> >>+
> >>+	return ret;
> >>+}
> >>  #endif
> >>
> >>  static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> >>@@ -1939,6 +1964,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> >>  	.set = pnv_ioda2_tce_build,
> >>  #ifdef CONFIG_IOMMU_API
> >>  	.exchange = pnv_ioda2_tce_xchg,
> >>+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
> >>  #endif
> >>  	.clear = pnv_ioda2_tce_free,
> >>  	.get = pnv_tce_get,
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
@ 2016-03-08  4:50         ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08  4:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 6336 bytes --]

On Mon, Mar 07, 2016 at 06:32:23PM +1100, Alexey Kardashevskiy wrote:
> On 03/07/2016 05:05 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
> >>In real mode, TCE tables are invalidated using different
> >>cache-inhibited store instructions which is different from
> >>the virtual mode.
> >>
> >>This defines and implements exchange_rm() callback. This does not
> >>define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> >>exchange/exchange_rm are only to be used by KVM for VFIO.
> >>
> >>The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> >>
> >>This replaces list_for_each_entry_rcu with its lockless version as
> >>from now on pnv_pci_ioda2_tce_invalidate() can be called in
> >>the real mode too.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  arch/powerpc/include/asm/iommu.h          |  7 +++++++
> >>  arch/powerpc/kernel/iommu.c               | 15 +++++++++++++++
> >>  arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++++++++++++++++++++++++++-
> >>  3 files changed, 49 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index 7b87bab..3ca877a 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -64,6 +64,11 @@ struct iommu_table_ops {
> >>  			long index,
> >>  			unsigned long *hpa,
> >>  			enum dma_data_direction *direction);
> >>+	/* Real mode */
> >>+	int (*exchange_rm)(struct iommu_table *tbl,
> >>+			long index,
> >>+			unsigned long *hpa,
> >>+			enum dma_data_direction *direction);
> >>  #endif
> >>  	void (*clear)(struct iommu_table *tbl,
> >>  			long index, long npages);
> >>@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
> >>  extern int __init tce_iommu_bus_notifier_init(void);
> >>  extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> >>  		unsigned long *hpa, enum dma_data_direction *direction);
> >>+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> >>+		unsigned long *hpa, enum dma_data_direction *direction);
> >>  #else
> >>  static inline void iommu_register_group(struct iommu_table_group *table_group,
> >>  					int pci_domain_number,
> >>diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >>index a8e3490..2fcc48b 100644
> >>--- a/arch/powerpc/kernel/iommu.c
> >>+++ b/arch/powerpc/kernel/iommu.c
> >>@@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
> >>
> >>+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> >>+		unsigned long *hpa, enum dma_data_direction *direction)
> >>+{
> >>+	long ret;
> >>+
> >>+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> >>+
> >>+	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> >>+			(*direction == DMA_BIDIRECTIONAL)))
> >>+		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
> >>+
> >>+	return ret;
> >>+}
> >>+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
> >
> >>  int iommu_add_device(struct device *dev)
> >>  {
> >>  	struct iommu_table *tbl;
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index c5baaf3..bed1944 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1791,6 +1791,18 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
> >>
> >>  	return ret;
> >>  }
> >>+
> >>+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
> >>+		unsigned long *hpa, enum dma_data_direction *direction)
> >>+{
> >>+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> >>+
> >>+	if (!ret && (tbl->it_type &
> >>+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> >>+		pnv_pci_ioda1_tce_invalidate(tbl, index, 1, true);
> >>+
> >>+	return ret;
> >>+}
> >>  #endif
> >
> >Both your _rm variants are identical to the non _rm versions.  Why not
> >just set the function poiinter to the same thing, rather than copying
> >the whole function.
> 
> 
> The last parameter - "rm" - to pnv_pci_ioda1_tce_invalidate() is
> different.

Ah, missed that, sorry.

> 
> 
> >
> >>  static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> >>@@ -1806,6 +1818,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
> >>  	.set = pnv_ioda1_tce_build,
> >>  #ifdef CONFIG_IOMMU_API
> >>  	.exchange = pnv_ioda1_tce_xchg,
> >>+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
> >>  #endif
> >>  	.clear = pnv_ioda1_tce_free,
> >>  	.get = pnv_tce_get,
> >>@@ -1866,7 +1879,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> >>  {
> >>  	struct iommu_table_group_link *tgl;
> >>
> >>-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
> >>+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
> >>  		struct pnv_ioda_pe *npe;
> >>  		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
> >>  				struct pnv_ioda_pe, table_group);
> >>@@ -1918,6 +1931,18 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
> >>
> >>  	return ret;
> >>  }
> >>+
> >>+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
> >>+		unsigned long *hpa, enum dma_data_direction *direction)
> >>+{
> >>+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> >>+
> >>+	if (!ret && (tbl->it_type &
> >>+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> >>+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
> >>+
> >>+	return ret;
> >>+}
> >>  #endif
> >>
> >>  static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> >>@@ -1939,6 +1964,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> >>  	.set = pnv_ioda2_tce_build,
> >>  #ifdef CONFIG_IOMMU_API
> >>  	.exchange = pnv_ioda2_tce_xchg,
> >>+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
> >>  #endif
> >>  	.clear = pnv_ioda2_tce_free,
> >>  	.get = pnv_tce_get,
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table
  2016-03-07  9:38       ` Alexey Kardashevskiy
@ 2016-03-08  4:55         ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08  4:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 9192 bytes --]

On Mon, Mar 07, 2016 at 08:38:13PM +1100, Alexey Kardashevskiy wrote:
> On 03/07/2016 05:25 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:14PM +1100, Alexey Kardashevskiy wrote:
> >>The existing in-kernel TCE table for emulated devices contains
> >>guest physical addresses which are accesses by emulated devices.
> >>Since we need to keep this information for VFIO devices too
> >>in order to implement H_GET_TCE, we are reusing it.
> >>
> >>This adds IOMMU group list to kvmppc_spapr_tce_table. Each group
> >>will have an iommu_table pointer.
> >>
> >>This adds kvm_spapr_tce_attach_iommu_group() helper and its detach
> >>counterpart to manage the lists.
> >>
> >>This puts a group when:
> >>- guest copy of TCE table is destroyed when TCE table fd is closed;
> >>- kvm_spapr_tce_detach_iommu_group() is called from
> >>the KVM_DEV_VFIO_GROUP_DEL ioctl handler in the case vfio-pci hotunplug
> >>(will be added in the following patch).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  arch/powerpc/include/asm/kvm_host.h |   8 +++
> >>  arch/powerpc/include/asm/kvm_ppc.h  |   6 ++
> >>  arch/powerpc/kvm/book3s_64_vio.c    | 108 ++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 122 insertions(+)
> >>
> >>diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >>index 2e7c791..2c5c823 100644
> >>--- a/arch/powerpc/include/asm/kvm_host.h
> >>+++ b/arch/powerpc/include/asm/kvm_host.h
> >>@@ -178,6 +178,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>
> >>+struct kvmppc_spapr_tce_group {
> >>+	struct list_head next;
> >>+	struct rcu_head rcu;
> >>+	struct iommu_group *refgrp;/* for reference counting only */
> >>+	struct iommu_table *tbl;
> >>+};
> >>+
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >>@@ -186,6 +193,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >>+	struct list_head groups;
> >>  	struct page *pages[0];
> >>  };
> >>
> >>diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >>index 2544eda..d1482dc 100644
> >>--- a/arch/powerpc/include/asm/kvm_ppc.h
> >>+++ b/arch/powerpc/include/asm/kvm_ppc.h
> >>@@ -164,6 +164,12 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >>
> >>+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >>+				unsigned long liobn,
> >>+				phys_addr_t start_addr,
> >>+				struct iommu_group *grp);
> >>+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >>+				struct iommu_group *grp);
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >>  extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> >>diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>index 2c2d103..846d16d 100644
> >>--- a/arch/powerpc/kvm/book3s_64_vio.c
> >>+++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>@@ -27,6 +27,7 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >>+#include <linux/iommu.h>
> >>
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >>@@ -95,10 +96,18 @@ static void release_spapr_tce_table(struct rcu_head *head)
> >>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >>  			struct kvmppc_spapr_tce_table, rcu);
> >>  	unsigned long i, npages = kvmppc_tce_pages(stt->size);
> >>+	struct kvmppc_spapr_tce_group *kg;
> >>
> >>  	for (i = 0; i < npages; i++)
> >>  		__free_page(stt->pages[i]);
> >>
> >>+	while (!list_empty(&stt->groups)) {
> >>+		kg = list_first_entry(&stt->groups,
> >>+				struct kvmppc_spapr_tce_group, next);
> >>+		list_del(&kg->next);
> >>+		kfree(kg);
> >>+	}
> >>+
> >>  	kfree(stt);
> >>  }
> >>
> >>@@ -129,9 +138,15 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
> >>  static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt = filp->private_data;
> >>+	struct kvmppc_spapr_tce_group *kg;
> >>
> >>  	list_del_rcu(&stt->list);
> >>
> >>+	list_for_each_entry_rcu(kg, &stt->groups, next)	{
> >>+		iommu_group_put(kg->refgrp);
> >>+		kg->refgrp = NULL;
> >>+	}
> >
> >What's the reason for this kind of two-phase deletion?  Dereffing the
> >group here, and setting to NULL, then actually removing from the liast above.
> 
> Well, this way I have only one RCU-delayed release_spapr_tce_table(). The
> other option would be to call for each @kg:
> - list_del(&kg->next);
> - call_rcu()
> 
> as release_spapr_tce_table() won't be able to delete them - they are not in
> the list anymore.

Ah, ok, that makes sense.

> I suppose I can reuse kvm_spapr_tce_put_group(), this looks inaccurate...
> 
> 
> 
> >
> >>  	kvm_put_kvm(stt->kvm);
> >>
> >>  	kvmppc_account_memlimit(
> >>@@ -146,6 +161,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
> >>  	.release	= kvm_spapr_tce_release,
> >>  };
> >>
> >>+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >>+				unsigned long liobn,
> >>+				phys_addr_t start_addr,
> >>+				struct iommu_group *grp)
> >>+{
> >>+	struct kvmppc_spapr_tce_table *stt = NULL;
> >>+	struct iommu_table_group *table_group;
> >>+	long i;
> >>+	bool found = false;
> >>+	struct kvmppc_spapr_tce_group *kg;
> >>+	struct iommu_table *tbltmp;
> >>+
> >>+	/* Check this LIOBN hasn't been previously allocated */
> >
> >This comment does not appear to be correct.
> >
> >>+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >>+		if (stt->liobn == liobn) {
> >>+			if ((stt->offset << stt->page_shift) != start_addr)
> >>+				return -EINVAL;
> >>+
> >>+			found = true;
> >>+			break;
> >>+		}
> >>+	}
> >>+
> >>+	if (!found)
> >>+		return -ENODEV;
> >>+
> >>+	/* Find IOMMU group and table at @start_addr */
> >>+	table_group = iommu_group_get_iommudata(grp);
> >>+	if (!table_group)
> >>+		return -EFAULT;
> >>+
> >>+	tbltmp = NULL;
> >>+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >>+		struct iommu_table *tbl = table_group->tables[i];
> >>+
> >>+		if (!tbl)
> >>+			continue;
> >>+
> >>+		if ((tbl->it_page_shift == stt->page_shift) &&
> >>+				(tbl->it_offset == stt->offset)) {
> >>+			tbltmp = tbl;
> >>+			break;
> >>+		}
> >>+	}
> >>+	if (!tbltmp)
> >>+		return -ENODEV;
> >>+
> >>+	list_for_each_entry_rcu(kg, &stt->groups, next) {
> >>+		if (kg->refgrp == grp)
> >>+			return -EBUSY;
> >>+	}
> >>+
> >>+	kg = kzalloc(sizeof(*kg), GFP_KERNEL);
> >>+	kg->refgrp = grp;
> >>+	kg->tbl = tbltmp;
> >>+	list_add_rcu(&kg->next, &stt->groups);
> >>+
> >>+	return 0;
> >>+}
> >>+
> >>+static void kvm_spapr_tce_put_group(struct rcu_head *head)
> >>+{
> >>+	struct kvmppc_spapr_tce_group *kg = container_of(head,
> >>+			struct kvmppc_spapr_tce_group, rcu);
> >>+
> >>+	iommu_group_put(kg->refgrp);
> >>+	kg->refgrp = NULL;
> >>+	kfree(kg);
> >>+}
> >>+
> >>+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >>+				struct iommu_group *grp)
> >
> >Hrm.  attach takes an explicit liobn, but this one iterates over all
> >liobns.  Why the asymmetry?
> 
> 
> For attach(), LIOBN is specified in an additional (to VFIO KVM device's "add
> group") ioctl(). There is no need for "detach" ioctl() as we only want this
> detach() to happen when a group is removed from a container, and in this
> case the usual KVM_DEV_VFIO_GROUP_DEL is good enough hint that we need to
> detach LIOBN. Since _DEL does not take LIOBN, here I have a loop.
> 
> I'll put this in the commit log next time.
> 
> 
> >
> >>+{
> >>+	struct kvmppc_spapr_tce_table *stt;
> >>+	struct iommu_table_group *table_group;
> >>+	struct kvmppc_spapr_tce_group *kg;
> >>+
> >>+	table_group = iommu_group_get_iommudata(grp);
> >>+	if (!table_group)
> >>+		return;
> >>+
> >>+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >>+		list_for_each_entry_rcu(kg, &stt->groups, next) {
> >>+			if (kg->refgrp == grp) {
> >>+				list_del_rcu(&kg->next);
> >>+				call_rcu(&kg->rcu, kvm_spapr_tce_put_group);
> >>+				break;
> >>+			}
> >>+		}
> >>+	}
> >>+}
> >>+
> >>  long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				   struct kvm_create_spapr_tce_64 *args)
> >>  {
> >>@@ -181,6 +288,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	stt->offset = args->offset;
> >>  	stt->size = size;
> >>  	stt->kvm = kvm;
> >>+	INIT_LIST_HEAD_RCU(&stt->groups);
> >>
> >>  	for (i = 0; i < npages; i++) {
> >>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table
@ 2016-03-08  4:55         ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08  4:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 9192 bytes --]

On Mon, Mar 07, 2016 at 08:38:13PM +1100, Alexey Kardashevskiy wrote:
> On 03/07/2016 05:25 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:14PM +1100, Alexey Kardashevskiy wrote:
> >>The existing in-kernel TCE table for emulated devices contains
> >>guest physical addresses which are accesses by emulated devices.
> >>Since we need to keep this information for VFIO devices too
> >>in order to implement H_GET_TCE, we are reusing it.
> >>
> >>This adds IOMMU group list to kvmppc_spapr_tce_table. Each group
> >>will have an iommu_table pointer.
> >>
> >>This adds kvm_spapr_tce_attach_iommu_group() helper and its detach
> >>counterpart to manage the lists.
> >>
> >>This puts a group when:
> >>- guest copy of TCE table is destroyed when TCE table fd is closed;
> >>- kvm_spapr_tce_detach_iommu_group() is called from
> >>the KVM_DEV_VFIO_GROUP_DEL ioctl handler in the case vfio-pci hotunplug
> >>(will be added in the following patch).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  arch/powerpc/include/asm/kvm_host.h |   8 +++
> >>  arch/powerpc/include/asm/kvm_ppc.h  |   6 ++
> >>  arch/powerpc/kvm/book3s_64_vio.c    | 108 ++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 122 insertions(+)
> >>
> >>diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >>index 2e7c791..2c5c823 100644
> >>--- a/arch/powerpc/include/asm/kvm_host.h
> >>+++ b/arch/powerpc/include/asm/kvm_host.h
> >>@@ -178,6 +178,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>
> >>+struct kvmppc_spapr_tce_group {
> >>+	struct list_head next;
> >>+	struct rcu_head rcu;
> >>+	struct iommu_group *refgrp;/* for reference counting only */
> >>+	struct iommu_table *tbl;
> >>+};
> >>+
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >>@@ -186,6 +193,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >>+	struct list_head groups;
> >>  	struct page *pages[0];
> >>  };
> >>
> >>diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >>index 2544eda..d1482dc 100644
> >>--- a/arch/powerpc/include/asm/kvm_ppc.h
> >>+++ b/arch/powerpc/include/asm/kvm_ppc.h
> >>@@ -164,6 +164,12 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >>
> >>+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >>+				unsigned long liobn,
> >>+				phys_addr_t start_addr,
> >>+				struct iommu_group *grp);
> >>+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >>+				struct iommu_group *grp);
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >>  extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> >>diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>index 2c2d103..846d16d 100644
> >>--- a/arch/powerpc/kvm/book3s_64_vio.c
> >>+++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>@@ -27,6 +27,7 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >>+#include <linux/iommu.h>
> >>
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >>@@ -95,10 +96,18 @@ static void release_spapr_tce_table(struct rcu_head *head)
> >>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >>  			struct kvmppc_spapr_tce_table, rcu);
> >>  	unsigned long i, npages = kvmppc_tce_pages(stt->size);
> >>+	struct kvmppc_spapr_tce_group *kg;
> >>
> >>  	for (i = 0; i < npages; i++)
> >>  		__free_page(stt->pages[i]);
> >>
> >>+	while (!list_empty(&stt->groups)) {
> >>+		kg = list_first_entry(&stt->groups,
> >>+				struct kvmppc_spapr_tce_group, next);
> >>+		list_del(&kg->next);
> >>+		kfree(kg);
> >>+	}
> >>+
> >>  	kfree(stt);
> >>  }
> >>
> >>@@ -129,9 +138,15 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
> >>  static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt = filp->private_data;
> >>+	struct kvmppc_spapr_tce_group *kg;
> >>
> >>  	list_del_rcu(&stt->list);
> >>
> >>+	list_for_each_entry_rcu(kg, &stt->groups, next)	{
> >>+		iommu_group_put(kg->refgrp);
> >>+		kg->refgrp = NULL;
> >>+	}
> >
> >What's the reason for this kind of two-phase deletion?  Dereffing the
> >group here, and setting to NULL, then actually removing from the liast above.
> 
> Well, this way I have only one RCU-delayed release_spapr_tce_table(). The
> other option would be to call for each @kg:
> - list_del(&kg->next);
> - call_rcu()
> 
> as release_spapr_tce_table() won't be able to delete them - they are not in
> the list anymore.

Ah, ok, that makes sense.

> I suppose I can reuse kvm_spapr_tce_put_group(), this looks inaccurate...
> 
> 
> 
> >
> >>  	kvm_put_kvm(stt->kvm);
> >>
> >>  	kvmppc_account_memlimit(
> >>@@ -146,6 +161,98 @@ static const struct file_operations kvm_spapr_tce_fops = {
> >>  	.release	= kvm_spapr_tce_release,
> >>  };
> >>
> >>+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
> >>+				unsigned long liobn,
> >>+				phys_addr_t start_addr,
> >>+				struct iommu_group *grp)
> >>+{
> >>+	struct kvmppc_spapr_tce_table *stt = NULL;
> >>+	struct iommu_table_group *table_group;
> >>+	long i;
> >>+	bool found = false;
> >>+	struct kvmppc_spapr_tce_group *kg;
> >>+	struct iommu_table *tbltmp;
> >>+
> >>+	/* Check this LIOBN hasn't been previously allocated */
> >
> >This comment does not appear to be correct.
> >
> >>+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >>+		if (stt->liobn == liobn) {
> >>+			if ((stt->offset << stt->page_shift) != start_addr)
> >>+				return -EINVAL;
> >>+
> >>+			found = true;
> >>+			break;
> >>+		}
> >>+	}
> >>+
> >>+	if (!found)
> >>+		return -ENODEV;
> >>+
> >>+	/* Find IOMMU group and table at @start_addr */
> >>+	table_group = iommu_group_get_iommudata(grp);
> >>+	if (!table_group)
> >>+		return -EFAULT;
> >>+
> >>+	tbltmp = NULL;
> >>+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >>+		struct iommu_table *tbl = table_group->tables[i];
> >>+
> >>+		if (!tbl)
> >>+			continue;
> >>+
> >>+		if ((tbl->it_page_shift == stt->page_shift) &&
> >>+				(tbl->it_offset == stt->offset)) {
> >>+			tbltmp = tbl;
> >>+			break;
> >>+		}
> >>+	}
> >>+	if (!tbltmp)
> >>+		return -ENODEV;
> >>+
> >>+	list_for_each_entry_rcu(kg, &stt->groups, next) {
> >>+		if (kg->refgrp == grp)
> >>+			return -EBUSY;
> >>+	}
> >>+
> >>+	kg = kzalloc(sizeof(*kg), GFP_KERNEL);
> >>+	kg->refgrp = grp;
> >>+	kg->tbl = tbltmp;
> >>+	list_add_rcu(&kg->next, &stt->groups);
> >>+
> >>+	return 0;
> >>+}
> >>+
> >>+static void kvm_spapr_tce_put_group(struct rcu_head *head)
> >>+{
> >>+	struct kvmppc_spapr_tce_group *kg = container_of(head,
> >>+			struct kvmppc_spapr_tce_group, rcu);
> >>+
> >>+	iommu_group_put(kg->refgrp);
> >>+	kg->refgrp = NULL;
> >>+	kfree(kg);
> >>+}
> >>+
> >>+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
> >>+				struct iommu_group *grp)
> >
> >Hrm.  attach takes an explicit liobn, but this one iterates over all
> >liobns.  Why the asymmetry?
> 
> 
> For attach(), LIOBN is specified in an additional (to VFIO KVM device's "add
> group") ioctl(). There is no need for "detach" ioctl() as we only want this
> detach() to happen when a group is removed from a container, and in this
> case the usual KVM_DEV_VFIO_GROUP_DEL is good enough hint that we need to
> detach LIOBN. Since _DEL does not take LIOBN, here I have a loop.
> 
> I'll put this in the commit log next time.
> 
> 
> >
> >>+{
> >>+	struct kvmppc_spapr_tce_table *stt;
> >>+	struct iommu_table_group *table_group;
> >>+	struct kvmppc_spapr_tce_group *kg;
> >>+
> >>+	table_group = iommu_group_get_iommudata(grp);
> >>+	if (!table_group)
> >>+		return;
> >>+
> >>+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >>+		list_for_each_entry_rcu(kg, &stt->groups, next) {
> >>+			if (kg->refgrp == grp) {
> >>+				list_del_rcu(&kg->next);
> >>+				call_rcu(&kg->rcu, kvm_spapr_tce_put_group);
> >>+				break;
> >>+			}
> >>+		}
> >>+	}
> >>+}
> >>+
> >>  long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				   struct kvm_create_spapr_tce_64 *args)
> >>  {
> >>@@ -181,6 +288,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	stt->offset = args->offset;
> >>  	stt->size = size;
> >>  	stt->kvm = kvm;
> >>+	INIT_LIST_HEAD_RCU(&stt->groups);
> >>
> >>  	for (i = 0; i < npages; i++) {
> >>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-03-07  6:00     ` David Gibson
@ 2016-03-08  5:47       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-08  5:47 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/07/2016 05:00 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
>> VFIO on sPAPR already implements guest memory pre-registration
>> when the entire guest RAM gets pinned. This can be used to translate
>> the physical address of a guest page containing the TCE list
>> from H_PUT_TCE_INDIRECT.
>>
>> This makes use of the pre-registrered memory API to access TCE list
>> pages in order to avoid unnecessary locking on the KVM memory
>> reverse map.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Ok.. so, what's the benefit of not having to lock the rmap?

Less locking -> less racing == good, no?


>
>> ---
>>   arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
>>   1 file changed, 70 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 44be73e..af155f6 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>   EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>
>>   #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
>> +{
>> +	struct task_struct *task;
>> +
>> +	task = vcpu->arch.run_task;
>> +	if (unlikely(!task || !task->mm))
>> +		return NULL;
>> +
>> +	return &task->mm->context;
>> +}
>> +
>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>> +{
>> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
>> +
>> +	if (unlikely(!mm))
>> +		return false;
>> +
>> +	return mm_iommu_preregistered(mm);
>> +}
>> +
>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>> +{
>> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
>> +
>> +	if (unlikely(!mm))
>> +		return NULL;
>> +
>> +	return mm_iommu_lookup_rm(mm, ua, size);
>> +}
>> +
>>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>   		      unsigned long ioba, unsigned long tce)
>>   {
>> @@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>   	if (ret != H_SUCCESS)
>>   		return ret;
>>
>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>> -		return H_TOO_HARD;
>> +	if (kvmppc_preregistered(vcpu)) {
>> +		/*
>> +		 * We get here if guest memory was pre-registered which
>> +		 * is normally VFIO case and gpa->hpa translation does not
>> +		 * depend on hpt.
>> +		 */
>> +		struct mm_iommu_table_group_mem_t *mem;
>>
>> -	rmap = (void *) vmalloc_to_phys(rmap);
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>> +			return H_TOO_HARD;
>>
>> -	/*
>> -	 * Synchronize with the MMU notifier callbacks in
>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>> -	 * While we have the rmap lock, code running on other CPUs
>> -	 * cannot finish unmapping the host real page that backs
>> -	 * this guest real page, so we are OK to access the host
>> -	 * real page.
>> -	 */
>> -	lock_rmap(rmap);
>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>> -		ret = H_TOO_HARD;
>> -		goto unlock_exit;
>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>> +		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
>> +			return H_TOO_HARD;
>> +	} else {
>> +		/*
>> +		 * This is emulated devices case.
>> +		 * We do not require memory to be preregistered in this case
>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>> +		 */
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>> +			return H_TOO_HARD;
>> +
>> +		rmap = (void *) vmalloc_to_phys(rmap);
>> +
>> +		/*
>> +		 * Synchronize with the MMU notifier callbacks in
>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>> +		 * While we have the rmap lock, code running on other CPUs
>> +		 * cannot finish unmapping the host real page that backs
>> +		 * this guest real page, so we are OK to access the host
>> +		 * real page.
>> +		 */
>> +		lock_rmap(rmap);
>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>> +			ret = H_TOO_HARD;
>> +			goto unlock_exit;
>> +		}
>>   	}
>>
>>   	for (i = 0; i < npages; ++i) {
>> @@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>   	}
>>
>>   unlock_exit:
>> -	unlock_rmap(rmap);
>> +	if (rmap)
>
> I don't see where rmap is initialized to NULL in the case where it's
> not being used.

@rmap is not new to this function, and it has always been initialized to 
NULL as it was returned via a pointer from kvmppc_gpa_to_ua().


>
>> +		unlock_rmap(rmap);
>>
>>   	return ret;
>>   }
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-03-08  5:47       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-08  5:47 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/07/2016 05:00 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
>> VFIO on sPAPR already implements guest memory pre-registration
>> when the entire guest RAM gets pinned. This can be used to translate
>> the physical address of a guest page containing the TCE list
>> from H_PUT_TCE_INDIRECT.
>>
>> This makes use of the pre-registrered memory API to access TCE list
>> pages in order to avoid unnecessary locking on the KVM memory
>> reverse map.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Ok.. so, what's the benefit of not having to lock the rmap?

Less locking -> less racing = good, no?


>
>> ---
>>   arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
>>   1 file changed, 70 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 44be73e..af155f6 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>   EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>
>>   #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
>> +{
>> +	struct task_struct *task;
>> +
>> +	task = vcpu->arch.run_task;
>> +	if (unlikely(!task || !task->mm))
>> +		return NULL;
>> +
>> +	return &task->mm->context;
>> +}
>> +
>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>> +{
>> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
>> +
>> +	if (unlikely(!mm))
>> +		return false;
>> +
>> +	return mm_iommu_preregistered(mm);
>> +}
>> +
>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>> +{
>> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
>> +
>> +	if (unlikely(!mm))
>> +		return NULL;
>> +
>> +	return mm_iommu_lookup_rm(mm, ua, size);
>> +}
>> +
>>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>   		      unsigned long ioba, unsigned long tce)
>>   {
>> @@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>   	if (ret != H_SUCCESS)
>>   		return ret;
>>
>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>> -		return H_TOO_HARD;
>> +	if (kvmppc_preregistered(vcpu)) {
>> +		/*
>> +		 * We get here if guest memory was pre-registered which
>> +		 * is normally VFIO case and gpa->hpa translation does not
>> +		 * depend on hpt.
>> +		 */
>> +		struct mm_iommu_table_group_mem_t *mem;
>>
>> -	rmap = (void *) vmalloc_to_phys(rmap);
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>> +			return H_TOO_HARD;
>>
>> -	/*
>> -	 * Synchronize with the MMU notifier callbacks in
>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>> -	 * While we have the rmap lock, code running on other CPUs
>> -	 * cannot finish unmapping the host real page that backs
>> -	 * this guest real page, so we are OK to access the host
>> -	 * real page.
>> -	 */
>> -	lock_rmap(rmap);
>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>> -		ret = H_TOO_HARD;
>> -		goto unlock_exit;
>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>> +		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
>> +			return H_TOO_HARD;
>> +	} else {
>> +		/*
>> +		 * This is emulated devices case.
>> +		 * We do not require memory to be preregistered in this case
>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>> +		 */
>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>> +			return H_TOO_HARD;
>> +
>> +		rmap = (void *) vmalloc_to_phys(rmap);
>> +
>> +		/*
>> +		 * Synchronize with the MMU notifier callbacks in
>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>> +		 * While we have the rmap lock, code running on other CPUs
>> +		 * cannot finish unmapping the host real page that backs
>> +		 * this guest real page, so we are OK to access the host
>> +		 * real page.
>> +		 */
>> +		lock_rmap(rmap);
>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>> +			ret = H_TOO_HARD;
>> +			goto unlock_exit;
>> +		}
>>   	}
>>
>>   	for (i = 0; i < npages; ++i) {
>> @@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>   	}
>>
>>   unlock_exit:
>> -	unlock_rmap(rmap);
>> +	if (rmap)
>
> I don't see where rmap is initialized to NULL in the case where it's
> not being used.

@rmap is not new to this function, and it has always been initialized to 
NULL as it was returned via a pointer from kvmppc_gpa_to_ua().


>
>> +		unlock_rmap(rmap);
>>
>>   	return ret;
>>   }
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-03-08  5:47       ` Alexey Kardashevskiy
@ 2016-03-08  6:30         ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08  6:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5527 bytes --]

On Tue, Mar 08, 2016 at 04:47:20PM +1100, Alexey Kardashevskiy wrote:
> On 03/07/2016 05:00 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> >>VFIO on sPAPR already implements guest memory pre-registration
> >>when the entire guest RAM gets pinned. This can be used to translate
> >>the physical address of a guest page containing the TCE list
> >>from H_PUT_TCE_INDIRECT.
> >>
> >>This makes use of the pre-registrered memory API to access TCE list
> >>pages in order to avoid unnecessary locking on the KVM memory
> >>reverse map.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >Ok.. so, what's the benefit of not having to lock the rmap?
> 
> Less locking -> less racing == good, no?

Well.. maybe.  The increased difficulty in verifying that the code is
correct isn't always a good price to pay.

> >>---
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
> >>  1 file changed, 70 insertions(+), 16 deletions(-)
> >>
> >>diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>index 44be73e..af155f6 100644
> >>--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>@@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>
> >>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >>+static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
> >>+{
> >>+	struct task_struct *task;
> >>+
> >>+	task = vcpu->arch.run_task;
> >>+	if (unlikely(!task || !task->mm))
> >>+		return NULL;
> >>+
> >>+	return &task->mm->context;
> >>+}
> >>+
> >>+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> >>+{
> >>+	mm_context_t *mm = kvmppc_mm_context(vcpu);
> >>+
> >>+	if (unlikely(!mm))
> >>+		return false;
> >>+
> >>+	return mm_iommu_preregistered(mm);
> >>+}
> >>+
> >>+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> >>+{
> >>+	mm_context_t *mm = kvmppc_mm_context(vcpu);
> >>+
> >>+	if (unlikely(!mm))
> >>+		return NULL;
> >>+
> >>+	return mm_iommu_lookup_rm(mm, ua, size);
> >>+}
> >>+
> >>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		      unsigned long ioba, unsigned long tce)
> >>  {
> >>@@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>
> >>-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>-		return H_TOO_HARD;
> >>+	if (kvmppc_preregistered(vcpu)) {
> >>+		/*
> >>+		 * We get here if guest memory was pre-registered which
> >>+		 * is normally VFIO case and gpa->hpa translation does not
> >>+		 * depend on hpt.
> >>+		 */
> >>+		struct mm_iommu_table_group_mem_t *mem;
> >>
> >>-	rmap = (void *) vmalloc_to_phys(rmap);
> >>+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>+			return H_TOO_HARD;
> >>
> >>-	/*
> >>-	 * Synchronize with the MMU notifier callbacks in
> >>-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>-	 * While we have the rmap lock, code running on other CPUs
> >>-	 * cannot finish unmapping the host real page that backs
> >>-	 * this guest real page, so we are OK to access the host
> >>-	 * real page.
> >>-	 */
> >>-	lock_rmap(rmap);
> >>-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>-		ret = H_TOO_HARD;
> >>-		goto unlock_exit;
> >>+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>+		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
> >>+			return H_TOO_HARD;
> >>+	} else {
> >>+		/*
> >>+		 * This is emulated devices case.
> >>+		 * We do not require memory to be preregistered in this case
> >>+		 * so lock rmap and do __find_linux_pte_or_hugepte().
> >>+		 */
> >>+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>+			return H_TOO_HARD;
> >>+
> >>+		rmap = (void *) vmalloc_to_phys(rmap);
> >>+
> >>+		/*
> >>+		 * Synchronize with the MMU notifier callbacks in
> >>+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>+		 * While we have the rmap lock, code running on other CPUs
> >>+		 * cannot finish unmapping the host real page that backs
> >>+		 * this guest real page, so we are OK to access the host
> >>+		 * real page.
> >>+		 */
> >>+		lock_rmap(rmap);
> >>+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>+			ret = H_TOO_HARD;
> >>+			goto unlock_exit;
> >>+		}
> >>  	}
> >>
> >>  	for (i = 0; i < npages; ++i) {
> >>@@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>
> >>  unlock_exit:
> >>-	unlock_rmap(rmap);
> >>+	if (rmap)
> >
> >I don't see where rmap is initialized to NULL in the case where it's
> >not being used.
> 
> @rmap is not new to this function, and it has always been initialized to
> NULL as it was returned via a pointer from kvmppc_gpa_to_ua().

This comment confuses me.  Looking closer at the code I see you're
right, and it's initialized to NULL where defined, which I missed.

But that has nothing to do with being returned by pointer from
kvmppc_gpa_to_ua(), since one of your branches in the new code no
longer passes &rmap to that function.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-03-08  6:30         ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08  6:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5527 bytes --]

On Tue, Mar 08, 2016 at 04:47:20PM +1100, Alexey Kardashevskiy wrote:
> On 03/07/2016 05:00 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> >>VFIO on sPAPR already implements guest memory pre-registration
> >>when the entire guest RAM gets pinned. This can be used to translate
> >>the physical address of a guest page containing the TCE list
> >>from H_PUT_TCE_INDIRECT.
> >>
> >>This makes use of the pre-registrered memory API to access TCE list
> >>pages in order to avoid unnecessary locking on the KVM memory
> >>reverse map.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >Ok.. so, what's the benefit of not having to lock the rmap?
> 
> Less locking -> less racing == good, no?

Well.. maybe.  The increased difficulty in verifying that the code is
correct isn't always a good price to pay.

> >>---
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
> >>  1 file changed, 70 insertions(+), 16 deletions(-)
> >>
> >>diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>index 44be73e..af155f6 100644
> >>--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>@@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>
> >>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >>+static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
> >>+{
> >>+	struct task_struct *task;
> >>+
> >>+	task = vcpu->arch.run_task;
> >>+	if (unlikely(!task || !task->mm))
> >>+		return NULL;
> >>+
> >>+	return &task->mm->context;
> >>+}
> >>+
> >>+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> >>+{
> >>+	mm_context_t *mm = kvmppc_mm_context(vcpu);
> >>+
> >>+	if (unlikely(!mm))
> >>+		return false;
> >>+
> >>+	return mm_iommu_preregistered(mm);
> >>+}
> >>+
> >>+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> >>+{
> >>+	mm_context_t *mm = kvmppc_mm_context(vcpu);
> >>+
> >>+	if (unlikely(!mm))
> >>+		return NULL;
> >>+
> >>+	return mm_iommu_lookup_rm(mm, ua, size);
> >>+}
> >>+
> >>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		      unsigned long ioba, unsigned long tce)
> >>  {
> >>@@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>
> >>-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>-		return H_TOO_HARD;
> >>+	if (kvmppc_preregistered(vcpu)) {
> >>+		/*
> >>+		 * We get here if guest memory was pre-registered which
> >>+		 * is normally VFIO case and gpa->hpa translation does not
> >>+		 * depend on hpt.
> >>+		 */
> >>+		struct mm_iommu_table_group_mem_t *mem;
> >>
> >>-	rmap = (void *) vmalloc_to_phys(rmap);
> >>+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>+			return H_TOO_HARD;
> >>
> >>-	/*
> >>-	 * Synchronize with the MMU notifier callbacks in
> >>-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>-	 * While we have the rmap lock, code running on other CPUs
> >>-	 * cannot finish unmapping the host real page that backs
> >>-	 * this guest real page, so we are OK to access the host
> >>-	 * real page.
> >>-	 */
> >>-	lock_rmap(rmap);
> >>-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>-		ret = H_TOO_HARD;
> >>-		goto unlock_exit;
> >>+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>+		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
> >>+			return H_TOO_HARD;
> >>+	} else {
> >>+		/*
> >>+		 * This is emulated devices case.
> >>+		 * We do not require memory to be preregistered in this case
> >>+		 * so lock rmap and do __find_linux_pte_or_hugepte().
> >>+		 */
> >>+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>+			return H_TOO_HARD;
> >>+
> >>+		rmap = (void *) vmalloc_to_phys(rmap);
> >>+
> >>+		/*
> >>+		 * Synchronize with the MMU notifier callbacks in
> >>+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>+		 * While we have the rmap lock, code running on other CPUs
> >>+		 * cannot finish unmapping the host real page that backs
> >>+		 * this guest real page, so we are OK to access the host
> >>+		 * real page.
> >>+		 */
> >>+		lock_rmap(rmap);
> >>+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>+			ret = H_TOO_HARD;
> >>+			goto unlock_exit;
> >>+		}
> >>  	}
> >>
> >>  	for (i = 0; i < npages; ++i) {
> >>@@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>
> >>  unlock_exit:
> >>-	unlock_rmap(rmap);
> >>+	if (rmap)
> >
> >I don't see where rmap is initialized to NULL in the case where it's
> >not being used.
> 
> @rmap is not new to this function, and it has always been initialized to
> NULL as it was returned via a pointer from kvmppc_gpa_to_ua().

This comment confuses me.  Looking closer at the code I see you're
right, and it's initialized to NULL where defined, which I missed.

But that has nothing to do with being returned by pointer from
kvmppc_gpa_to_ua(), since one of your branches in the new code no
longer passes &rmap to that function.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 7/9] KVM: PPC: Create a virtual-mode only TCE table handlers
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-08  6:32     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08  6:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5502 bytes --]

On Mon, Mar 07, 2016 at 02:41:15PM +1100, Alexey Kardashevskiy wrote:
> In-kernel VFIO acceleration needs different handling in real and virtual
> modes which makes it hard to support both modes in the same handler.
> 
> This creates a copy of kvmppc_rm_h_stuff_tce and kvmppc_rm_h_put_tce
> in addition to the existing kvmppc_rm_h_put_tce_indirect.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_64_vio.c        | 52 +++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c     |  8 ++---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |  4 +--
>  3 files changed, 57 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 846d16d..7965fc7 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -317,6 +317,32 @@ fail:
>  	return ret;
>  }
>  
> +long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> +		      unsigned long ioba, unsigned long tce)
> +{
> +	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	long ret;
> +
> +	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> +	/* 	    liobn, ioba, tce); */
> +
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	ret = kvmppc_ioba_validate(stt, ioba, 1);
> +	if (ret != H_SUCCESS)
> +		return ret;
> +
> +	ret = kvmppc_tce_validate(stt, tce);
> +	if (ret != H_SUCCESS)
> +		return ret;
> +
> +	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
> +
>  long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		unsigned long liobn, unsigned long ioba,
>  		unsigned long tce_list, unsigned long npages)
> @@ -372,3 +398,29 @@ unlock_exit:
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_put_tce_indirect);
> +
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	long i, ret;
> +
> +	stt = kvmppc_find_table(vcpu, liobn);
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	ret = kvmppc_ioba_validate(stt, ioba, npages);
> +	if (ret != H_SUCCESS)
> +		return ret;
> +
> +	/* Check permission bits only to allow userspace poison TCE for debug */
> +	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> +		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_h_stuff_tce);
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index af155f6..11163ae 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -212,8 +212,8 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  	return mm_iommu_lookup_rm(mm, ua, size);
>  }
>  
> -long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> -		      unsigned long ioba, unsigned long tce)
> +long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> +		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>  	long ret;
> @@ -236,7 +236,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  
>  	return H_SUCCESS;
>  }
> -EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
>  
>  static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
>  		unsigned long ua, unsigned long *phpa)
> @@ -350,7 +349,7 @@ unlock_exit:
>  	return ret;
>  }
>  
> -long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  		unsigned long liobn, unsigned long ioba,
>  		unsigned long tce_value, unsigned long npages)
>  {
> @@ -374,7 +373,6 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  
>  	return H_SUCCESS;
>  }
> -EXPORT_SYMBOL_GPL(kvmppc_h_stuff_tce);
>  
>  long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba)
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index ed16182..d6dad2c 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1928,7 +1928,7 @@ hcall_real_table:
>  	.long	DOTSYM(kvmppc_h_clear_ref) - hcall_real_table
>  	.long	DOTSYM(kvmppc_h_protect) - hcall_real_table
>  	.long	DOTSYM(kvmppc_h_get_tce) - hcall_real_table
> -	.long	DOTSYM(kvmppc_h_put_tce) - hcall_real_table
> +	.long	DOTSYM(kvmppc_rm_h_put_tce) - hcall_real_table
>  	.long	0		/* 0x24 - H_SET_SPRG0 */
>  	.long	DOTSYM(kvmppc_h_set_dabr) - hcall_real_table
>  	.long	0		/* 0x2c */
> @@ -2006,7 +2006,7 @@ hcall_real_table:
>  	.long	0		/* 0x12c */
>  	.long	0		/* 0x130 */
>  	.long	DOTSYM(kvmppc_h_set_xdabr) - hcall_real_table
> -	.long	DOTSYM(kvmppc_h_stuff_tce) - hcall_real_table
> +	.long	DOTSYM(kvmppc_rm_h_stuff_tce) - hcall_real_table
>  	.long	DOTSYM(kvmppc_rm_h_put_tce_indirect) - hcall_real_table
>  	.long	0		/* 0x140 */
>  	.long	0		/* 0x144 */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 7/9] KVM: PPC: Create a virtual-mode only TCE table handlers
@ 2016-03-08  6:32     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08  6:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 5502 bytes --]

On Mon, Mar 07, 2016 at 02:41:15PM +1100, Alexey Kardashevskiy wrote:
> In-kernel VFIO acceleration needs different handling in real and virtual
> modes which makes it hard to support both modes in the same handler.
> 
> This creates a copy of kvmppc_rm_h_stuff_tce and kvmppc_rm_h_put_tce
> in addition to the existing kvmppc_rm_h_put_tce_indirect.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_64_vio.c        | 52 +++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c     |  8 ++---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |  4 +--
>  3 files changed, 57 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 846d16d..7965fc7 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -317,6 +317,32 @@ fail:
>  	return ret;
>  }
>  
> +long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> +		      unsigned long ioba, unsigned long tce)
> +{
> +	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	long ret;
> +
> +	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> +	/* 	    liobn, ioba, tce); */
> +
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	ret = kvmppc_ioba_validate(stt, ioba, 1);
> +	if (ret != H_SUCCESS)
> +		return ret;
> +
> +	ret = kvmppc_tce_validate(stt, tce);
> +	if (ret != H_SUCCESS)
> +		return ret;
> +
> +	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
> +
>  long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		unsigned long liobn, unsigned long ioba,
>  		unsigned long tce_list, unsigned long npages)
> @@ -372,3 +398,29 @@ unlock_exit:
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_put_tce_indirect);
> +
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	long i, ret;
> +
> +	stt = kvmppc_find_table(vcpu, liobn);
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	ret = kvmppc_ioba_validate(stt, ioba, npages);
> +	if (ret != H_SUCCESS)
> +		return ret;
> +
> +	/* Check permission bits only to allow userspace poison TCE for debug */
> +	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> +		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_h_stuff_tce);
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index af155f6..11163ae 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -212,8 +212,8 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  	return mm_iommu_lookup_rm(mm, ua, size);
>  }
>  
> -long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> -		      unsigned long ioba, unsigned long tce)
> +long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> +		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>  	long ret;
> @@ -236,7 +236,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  
>  	return H_SUCCESS;
>  }
> -EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
>  
>  static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
>  		unsigned long ua, unsigned long *phpa)
> @@ -350,7 +349,7 @@ unlock_exit:
>  	return ret;
>  }
>  
> -long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  		unsigned long liobn, unsigned long ioba,
>  		unsigned long tce_value, unsigned long npages)
>  {
> @@ -374,7 +373,6 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  
>  	return H_SUCCESS;
>  }
> -EXPORT_SYMBOL_GPL(kvmppc_h_stuff_tce);
>  
>  long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba)
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index ed16182..d6dad2c 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1928,7 +1928,7 @@ hcall_real_table:
>  	.long	DOTSYM(kvmppc_h_clear_ref) - hcall_real_table
>  	.long	DOTSYM(kvmppc_h_protect) - hcall_real_table
>  	.long	DOTSYM(kvmppc_h_get_tce) - hcall_real_table
> -	.long	DOTSYM(kvmppc_h_put_tce) - hcall_real_table
> +	.long	DOTSYM(kvmppc_rm_h_put_tce) - hcall_real_table
>  	.long	0		/* 0x24 - H_SET_SPRG0 */
>  	.long	DOTSYM(kvmppc_h_set_dabr) - hcall_real_table
>  	.long	0		/* 0x2c */
> @@ -2006,7 +2006,7 @@ hcall_real_table:
>  	.long	0		/* 0x12c */
>  	.long	0		/* 0x130 */
>  	.long	DOTSYM(kvmppc_h_set_xdabr) - hcall_real_table
> -	.long	DOTSYM(kvmppc_h_stuff_tce) - hcall_real_table
> +	.long	DOTSYM(kvmppc_rm_h_stuff_tce) - hcall_real_table
>  	.long	DOTSYM(kvmppc_rm_h_put_tce_indirect) - hcall_real_table
>  	.long	0		/* 0x140 */
>  	.long	0		/* 0x144 */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-08 11:08     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08 11:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 17404 bytes --]

On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> Both real and virtual modes are supported. The kernel tries to
> handle a TCE request in the real mode, if fails it passes the request
> to the virtual mode to complete the operation. If it a virtual mode
> handler fails, the request is passed to user space; this is not expected
> to happen ever though.

Well... not expect to happen with a qemu which uses this.  Presumably
it will fall back to userspace routinely if you have an old qemu that
doesn't add the liobn mappings.

> The first user of this is VFIO on POWER. Trampolines to the VFIO external
> user API functions are required for this patch.

I'm not sure what you mean by "trampoline" here.

> This uses a VFIO KVM device to associate a logical bus number (LIOBN)
> with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
> requests.

Group fd?  Or container fd?  The group fd wouldn't make a lot of
sense.

> To make use of the feature, the user space has to create a guest view
> of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
> then associate a LIOBN with this table via VFIO KVM device,
> a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
> the next patch).
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Is that with or without DDW (i.e. with or without a 64-bit DMA window)?

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 370 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 7965fc7..9417d12 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -33,6 +33,7 @@
>  #include <asm/kvm_ppc.h>
>  #include <asm/kvm_book3s.h>
>  #include <asm/mmu-hash64.h>
> +#include <asm/mmu_context.h>
>  #include <asm/hvcall.h>
>  #include <asm/synch.h>
>  #include <asm/ppc-opcode.h>
> @@ -317,11 +318,161 @@ fail:
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
> +		unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup(*pua, pgsize);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
> +		unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		return H_HARDWARE;

H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
supplied a bad physical address, doesn't it?

> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(tbl, entry);
> +
> +	*pua = ua;

IIUC this means you have a copy of the UA for every group attached to
the TCE table, but they'll all be the same.  Any way to avoid that
duplication?

> +	return 0;
> +}
> +
> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	long idx, ret = H_HARDWARE;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> +
> +	/* Clear TCE */
> +	if (dir == DMA_NONE) {
> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> +			return H_PARAMETER;
> +
> +		return kvmppc_tce_iommu_unmap(tbl, entry);
> +	}
> +
> +	/* Put TCE */
> +	if (iommu_tce_put_param_check(tbl, ioba, tce))
> +		return H_PARAMETER;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long ioba,
> +		u64 __user *tces, unsigned long npages)
> +{
> +	unsigned long i, ret, tce, gpa;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	for (i = 0; i < npages; ++i) {
> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		if (iommu_tce_put_param_check(tbl, ioba +
> +				(i << tbl->it_page_shift), gpa))
> +			return H_PARAMETER;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(tces[i]);

tces is a user address, which means it should only be dereferenced via
get_user() or copy_from_user() helpers.

> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
> +				iommu_tce_direction(tce));
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	unsigned long i;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_tce_iommu_unmap(tbl, entry + i);
> +
> +	return H_SUCCESS;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>  	long ret;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -337,6 +488,15 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_h_put_tce_iommu(vcpu, kg->tbl, liobn, ioba, tce);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -351,6 +511,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	long i, ret = H_SUCCESS, idx;
>  	unsigned long entry, ua = 0;
>  	u64 __user *tces, tce;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	stt = kvmppc_find_table(vcpu, liobn);
>  	if (!stt)
> @@ -378,6 +540,16 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
> +				kg->tbl, ioba, tces, npages);
> +		if (ret != H_SUCCESS)
> +			goto unlock_exit;
> +	}
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -405,6 +577,8 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	stt = kvmppc_find_table(vcpu, liobn);
>  	if (!stt)
> @@ -418,6 +592,16 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, kg->tbl, liobn, ioba,
> +				tce_value, npages);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 11163ae..6567d6c 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -212,11 +213,162 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  	return mm_iommu_lookup_rm(mm, ua, size);
>  }
>  
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_SUCCESS;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_SUCCESS;
> +
> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_HARDWARE;
> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_rm_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long liobn,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> +
> +	/* Clear TCE */
> +	if (dir == DMA_NONE) {
> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> +			return H_PARAMETER;
> +
> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
> +	}
> +
> +	/* Put TCE */
> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> +		return H_PARAMETER;
> +
> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
> +}
> +
> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long ioba,
> +		u64 *tces, unsigned long npages)
> +{
> +	unsigned long i, ret, tce, gpa;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	for (i = 0; i < npages; ++i) {
> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		if (iommu_tce_put_param_check(tbl, ioba +
> +				(i << tbl->it_page_shift), gpa))
> +			return H_PARAMETER;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(tces[i]);
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
> +				iommu_tce_direction(tce));
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	unsigned long i;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
> +
> +	return H_SUCCESS;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>  	long ret;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -232,6 +384,16 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, kg->tbl,
> +				liobn, ioba, tce);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -272,6 +434,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	long i, ret = H_SUCCESS;
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	stt = kvmppc_find_table(vcpu, liobn);
>  	if (!stt)
> @@ -299,6 +462,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		 * depend on hpt.
>  		 */
>  		struct mm_iommu_table_group_mem_t *mem;
> +		struct kvmppc_spapr_tce_group *kg;
>  
>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>  			return H_TOO_HARD;
> @@ -306,6 +470,16 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>  		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
>  			return H_TOO_HARD;
> +
> +		list_for_each_entry_lockless(kg, &stt->groups, next) {
> +			if (kg->tbl == tbltmp)
> +				continue;
> +			tbltmp = kg->tbl;
> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
> +					kg->tbl, ioba, (u64 *)tces, npages);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
>  	} else {
>  		/*
>  		 * This is emulated devices case.
> @@ -355,6 +529,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	stt = kvmppc_find_table(vcpu, liobn);
>  	if (!stt)
> @@ -368,6 +544,16 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, kg->tbl,
> +				liobn, ioba, tce_value, npages);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
@ 2016-03-08 11:08     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-08 11:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 17404 bytes --]

On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> Both real and virtual modes are supported. The kernel tries to
> handle a TCE request in the real mode, if fails it passes the request
> to the virtual mode to complete the operation. If it a virtual mode
> handler fails, the request is passed to user space; this is not expected
> to happen ever though.

Well... not expect to happen with a qemu which uses this.  Presumably
it will fall back to userspace routinely if you have an old qemu that
doesn't add the liobn mappings.

> The first user of this is VFIO on POWER. Trampolines to the VFIO external
> user API functions are required for this patch.

I'm not sure what you mean by "trampoline" here.

> This uses a VFIO KVM device to associate a logical bus number (LIOBN)
> with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
> requests.

Group fd?  Or container fd?  The group fd wouldn't make a lot of
sense.

> To make use of the feature, the user space has to create a guest view
> of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
> then associate a LIOBN with this table via VFIO KVM device,
> a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
> the next patch).
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Is that with or without DDW (i.e. with or without a 64-bit DMA window)?

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 370 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 7965fc7..9417d12 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -33,6 +33,7 @@
>  #include <asm/kvm_ppc.h>
>  #include <asm/kvm_book3s.h>
>  #include <asm/mmu-hash64.h>
> +#include <asm/mmu_context.h>
>  #include <asm/hvcall.h>
>  #include <asm/synch.h>
>  #include <asm/ppc-opcode.h>
> @@ -317,11 +318,161 @@ fail:
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
> +		unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup(*pua, pgsize);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
> +		unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		return H_HARDWARE;

H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
supplied a bad physical address, doesn't it?

> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(tbl, entry);
> +
> +	*pua = ua;

IIUC this means you have a copy of the UA for every group attached to
the TCE table, but they'll all be the same.  Any way to avoid that
duplication?

> +	return 0;
> +}
> +
> +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	long idx, ret = H_HARDWARE;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> +
> +	/* Clear TCE */
> +	if (dir == DMA_NONE) {
> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> +			return H_PARAMETER;
> +
> +		return kvmppc_tce_iommu_unmap(tbl, entry);
> +	}
> +
> +	/* Put TCE */
> +	if (iommu_tce_put_param_check(tbl, ioba, tce))
> +		return H_PARAMETER;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long ioba,
> +		u64 __user *tces, unsigned long npages)
> +{
> +	unsigned long i, ret, tce, gpa;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	for (i = 0; i < npages; ++i) {
> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		if (iommu_tce_put_param_check(tbl, ioba +
> +				(i << tbl->it_page_shift), gpa))
> +			return H_PARAMETER;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(tces[i]);

tces is a user address, which means it should only be dereferenced via
get_user() or copy_from_user() helpers.

> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
> +				iommu_tce_direction(tce));
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	unsigned long i;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_tce_iommu_unmap(tbl, entry + i);
> +
> +	return H_SUCCESS;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>  	long ret;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -337,6 +488,15 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_h_put_tce_iommu(vcpu, kg->tbl, liobn, ioba, tce);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -351,6 +511,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	long i, ret = H_SUCCESS, idx;
>  	unsigned long entry, ua = 0;
>  	u64 __user *tces, tce;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	stt = kvmppc_find_table(vcpu, liobn);
>  	if (!stt)
> @@ -378,6 +540,16 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
> +				kg->tbl, ioba, tces, npages);
> +		if (ret != H_SUCCESS)
> +			goto unlock_exit;
> +	}
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -405,6 +577,8 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	stt = kvmppc_find_table(vcpu, liobn);
>  	if (!stt)
> @@ -418,6 +592,16 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_h_stuff_tce_iommu(vcpu, kg->tbl, liobn, ioba,
> +				tce_value, npages);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 11163ae..6567d6c 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -212,11 +213,162 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  	return mm_iommu_lookup_rm(mm, ua, size);
>  }
>  
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_SUCCESS;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_SUCCESS;
> +
> +	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_HARDWARE;
> +	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_rm_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
> +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long liobn,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	const enum dma_data_direction dir = iommu_tce_direction(tce);
> +
> +	/* Clear TCE */
> +	if (dir == DMA_NONE) {
> +		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
> +			return H_PARAMETER;
> +
> +		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
> +	}
> +
> +	/* Put TCE */
> +	if (iommu_tce_put_param_check(tbl, ioba, gpa))
> +		return H_PARAMETER;
> +
> +	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
> +}
> +
> +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl, unsigned long ioba,
> +		u64 *tces, unsigned long npages)
> +{
> +	unsigned long i, ret, tce, gpa;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	for (i = 0; i < npages; ++i) {
> +		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		if (iommu_tce_put_param_check(tbl, ioba +
> +				(i << tbl->it_page_shift), gpa))
> +			return H_PARAMETER;
> +	}
> +
> +	for (i = 0; i < npages; ++i) {
> +		tce = be64_to_cpu(tces[i]);
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
> +				iommu_tce_direction(tce));
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
> +		struct iommu_table *tbl,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	unsigned long i;
> +	const unsigned long entry = ioba >> tbl->it_page_shift;
> +
> +	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
> +
> +	return H_SUCCESS;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>  	long ret;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -232,6 +384,16 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_rm_h_put_tce_iommu(vcpu, kg->tbl,
> +				liobn, ioba, tce);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -272,6 +434,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	long i, ret = H_SUCCESS;
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	stt = kvmppc_find_table(vcpu, liobn);
>  	if (!stt)
> @@ -299,6 +462,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		 * depend on hpt.
>  		 */
>  		struct mm_iommu_table_group_mem_t *mem;
> +		struct kvmppc_spapr_tce_group *kg;
>  
>  		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>  			return H_TOO_HARD;
> @@ -306,6 +470,16 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>  		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
>  			return H_TOO_HARD;
> +
> +		list_for_each_entry_lockless(kg, &stt->groups, next) {
> +			if (kg->tbl == tbltmp)
> +				continue;
> +			tbltmp = kg->tbl;
> +			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
> +					kg->tbl, ioba, (u64 *)tces, npages);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
>  	} else {
>  		/*
>  		 * This is emulated devices case.
> @@ -355,6 +529,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_group *kg;
> +	struct iommu_table *tbltmp = NULL;
>  
>  	stt = kvmppc_find_table(vcpu, liobn);
>  	if (!stt)
> @@ -368,6 +544,16 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(kg, &stt->groups, next) {
> +		if (kg->tbl == tbltmp)
> +			continue;
> +		tbltmp = kg->tbl;
> +		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, kg->tbl,
> +				liobn, ioba, tce_value, npages);
> +		if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-09  5:45     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-09  5:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 10428 bytes --]

On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> identifier. LIOBNs are made up, advertised to guest systems and
> linked to IOMMU groups by the user space.
> In order to enable acceleration for IOMMU operations in KVM, we need
> to tell KVM the information about the LIOBN-to-group mapping.
> 
> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> is added which accepts:
> - a VFIO group fd and IO base address to find the actual hardware
> TCE table;
> - a LIOBN to assign to the found table.
> 
> Before notifying KVM about new link, this check the group for being
> registered with KVM device in order to release them at unexpected KVM
> finish.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> While we are here, this also fixes VFIO KVM device compiling to let it
> link to a KVM module.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>  arch/powerpc/kvm/Kconfig                   |   1 +
>  arch/powerpc/kvm/Makefile                  |   5 +-
>  arch/powerpc/kvm/powerpc.c                 |   1 +
>  include/uapi/linux/kvm.h                   |   9 +++
>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>  6 files changed, 140 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740..c0d3eb7 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,24 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.

AFAICT these changes are accurate for VFIO as it is already, in which
case it might be clearer to put them in a separate patch.

>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> +	kvm_device_attr.addr points to a struct:
> +		struct kvm_vfio_spapr_tce_liobn {
> +			__u32	argsz;
> +			__s32	fd;
> +			__u32	liobn;
> +			__u8	pad[4];
> +			__u64	start_addr;
> +		};
> +		where
> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +		@fd is a file descriptor for a VFIO group;
> +		@liobn is a logical bus id to be associated with the group;
> +		@start_addr is a DMA window offset on the IO (PCI) bus

For the cause of DDW and multiple windows, I'm assuming you can call
this multiple times with different LIOBNs and the same IOMMU group?

> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 1059846..dfa3488 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>  	select KVM
>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> +	select KVM_VFIO if VFIO
>  	---help---
>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>  	  in virtual machines on book3s_64 host processors.
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 7f7b6d8..71f577c 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>  KVM := ../../../virt/kvm
>  
>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> +		$(KVM)/eventfd.o

Please don't disable the VFIO device for the non-book3s case.  I added
it (even though it didn't do anything until now) so that libvirt
wouldn't choke when it finds it's not available.  Obviously the new
ioctl needs to be only for the right IOMMU setup, but the device
itself should be available always.

>  CFLAGS_e500_mmu.o := -I.
>  CFLAGS_e500_mmu_host.o := -I.
> @@ -87,6 +87,9 @@ endif
>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>  	book3s_xics.o
>  
> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> +	$(KVM)/vfio.o \
> +
>  kvm-book3s_64-module-objs += \
>  	$(KVM)/kvm_main.o \
>  	$(KVM)/eventfd.o \
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 19aa59b..63f188d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_ALLOC_HTAB:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 080ffbf..f1abbea 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce_liobn {
> +	__u32	argsz;
> +	__s32	fd;
> +	__u32	liobn;
> +	__u8	pad[4];
> +	__u64	start_addr;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 1dd087d..87c771e 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>  	symbol_put(vfio_group_put_external_user);
>  }
>  
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;

Should this be -ESOMETHING?

> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
> +
>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>  {
>  	long (*fn)(struct vfio_group *, unsigned long);
> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>  	mutex_unlock(&kv->lock);
>  }
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> +		struct vfio_group *vfio_group)


Shouldn't this go in the same patch that introduced the attach
function?

> +{
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (grp) {
> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> +		iommu_group_put(grp);
> +	}
> +}
> +#endif
> +
>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  {
>  	struct kvm_vfio *kv = dev->private;
> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  				continue;
>  
>  			list_del(&kvg->node);
> +#ifdef CONFIG_SPAPR_TCE_IOMMU

Better to make a no-op version of the call than have to #ifdef at the
callsite.

> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> +					kvg->vfio_group);
> +#endif
>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>  			kfree(kvg);
>  			ret = 0;
> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> +		struct kvm_vfio_spapr_tce_liobn param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> +				start_addr);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz)
> +			return -EINVAL;
> +
> +		f = fdget(param.fd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;

Shouldn't there be some runtime test for the type of the IOMMU?  It's
possible a kernel could be built for a platform supporting multiple
IOMMU types.

> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			int group_id;
> +			struct iommu_group *grp;
> +
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			group_id = kvm_vfio_external_user_iommu_id(
> +					kvg->vfio_group);
> +			grp = iommu_group_get_by_id(group_id);
> +			if (!grp) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.liobn, param.start_addr,
> +					grp);
> +			if (ret)
> +				iommu_group_put(grp);
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		kvm_vfio_group_put_external_user(vfio_group);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
> +#endif
>  			return 0;
>  		}
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-03-09  5:45     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-09  5:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 10428 bytes --]

On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> identifier. LIOBNs are made up, advertised to guest systems and
> linked to IOMMU groups by the user space.
> In order to enable acceleration for IOMMU operations in KVM, we need
> to tell KVM the information about the LIOBN-to-group mapping.
> 
> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> is added which accepts:
> - a VFIO group fd and IO base address to find the actual hardware
> TCE table;
> - a LIOBN to assign to the found table.
> 
> Before notifying KVM about new link, this check the group for being
> registered with KVM device in order to release them at unexpected KVM
> finish.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> While we are here, this also fixes VFIO KVM device compiling to let it
> link to a KVM module.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>  arch/powerpc/kvm/Kconfig                   |   1 +
>  arch/powerpc/kvm/Makefile                  |   5 +-
>  arch/powerpc/kvm/powerpc.c                 |   1 +
>  include/uapi/linux/kvm.h                   |   9 +++
>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>  6 files changed, 140 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740..c0d3eb7 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,24 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.

AFAICT these changes are accurate for VFIO as it is already, in which
case it might be clearer to put them in a separate patch.

>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> +	kvm_device_attr.addr points to a struct:
> +		struct kvm_vfio_spapr_tce_liobn {
> +			__u32	argsz;
> +			__s32	fd;
> +			__u32	liobn;
> +			__u8	pad[4];
> +			__u64	start_addr;
> +		};
> +		where
> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +		@fd is a file descriptor for a VFIO group;
> +		@liobn is a logical bus id to be associated with the group;
> +		@start_addr is a DMA window offset on the IO (PCI) bus

For the cause of DDW and multiple windows, I'm assuming you can call
this multiple times with different LIOBNs and the same IOMMU group?

> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 1059846..dfa3488 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>  	select KVM
>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> +	select KVM_VFIO if VFIO
>  	---help---
>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>  	  in virtual machines on book3s_64 host processors.
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 7f7b6d8..71f577c 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>  KVM := ../../../virt/kvm
>  
>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> +		$(KVM)/eventfd.o

Please don't disable the VFIO device for the non-book3s case.  I added
it (even though it didn't do anything until now) so that libvirt
wouldn't choke when it finds it's not available.  Obviously the new
ioctl needs to be only for the right IOMMU setup, but the device
itself should be available always.

>  CFLAGS_e500_mmu.o := -I.
>  CFLAGS_e500_mmu_host.o := -I.
> @@ -87,6 +87,9 @@ endif
>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>  	book3s_xics.o
>  
> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> +	$(KVM)/vfio.o \
> +
>  kvm-book3s_64-module-objs += \
>  	$(KVM)/kvm_main.o \
>  	$(KVM)/eventfd.o \
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 19aa59b..63f188d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_ALLOC_HTAB:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 080ffbf..f1abbea 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce_liobn {
> +	__u32	argsz;
> +	__s32	fd;
> +	__u32	liobn;
> +	__u8	pad[4];
> +	__u64	start_addr;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 1dd087d..87c771e 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>  	symbol_put(vfio_group_put_external_user);
>  }
>  
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;

Should this be -ESOMETHING?

> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
> +
>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>  {
>  	long (*fn)(struct vfio_group *, unsigned long);
> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>  	mutex_unlock(&kv->lock);
>  }
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> +		struct vfio_group *vfio_group)


Shouldn't this go in the same patch that introduced the attach
function?

> +{
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (grp) {
> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> +		iommu_group_put(grp);
> +	}
> +}
> +#endif
> +
>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  {
>  	struct kvm_vfio *kv = dev->private;
> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  				continue;
>  
>  			list_del(&kvg->node);
> +#ifdef CONFIG_SPAPR_TCE_IOMMU

Better to make a no-op version of the call than have to #ifdef at the
callsite.

> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> +					kvg->vfio_group);
> +#endif
>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>  			kfree(kvg);
>  			ret = 0;
> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> +		struct kvm_vfio_spapr_tce_liobn param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> +				start_addr);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz)
> +			return -EINVAL;
> +
> +		f = fdget(param.fd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;

Shouldn't there be some runtime test for the type of the IOMMU?  It's
possible a kernel could be built for a platform supporting multiple
IOMMU types.

> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			int group_id;
> +			struct iommu_group *grp;
> +
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			group_id = kvm_vfio_external_user_iommu_id(
> +					kvg->vfio_group);
> +			grp = iommu_group_get_by_id(group_id);
> +			if (!grp) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.liobn, param.start_addr,
> +					grp);
> +			if (ret)
> +				iommu_group_put(grp);
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		kvm_vfio_group_put_external_user(vfio_group);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
> +#endif
>  			return 0;
>  		}
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
  2016-03-08 11:08     ` David Gibson
@ 2016-03-09  8:46       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  8:46 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/08/2016 10:08 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> Both real and virtual modes are supported. The kernel tries to
>> handle a TCE request in the real mode, if fails it passes the request
>> to the virtual mode to complete the operation. If it a virtual mode
>> handler fails, the request is passed to user space; this is not expected
>> to happen ever though.
>
> Well... not expect to happen with a qemu which uses this.  Presumably
> it will fall back to userspace routinely if you have an old qemu that
> doesn't add the liobn mappings.


Ah. Ok, thanks, I'll add this to the commit log.


>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>> user API functions are required for this patch.
>
> I'm not sure what you mean by "trampoline" here.

For example, look at kvm_vfio_group_get_external_user. It calls 
symbol_get(vfio_group_get_external_user) and then calls a function via the 
returned pointer.

Is there a better word for this?


>> This uses a VFIO KVM device to associate a logical bus number (LIOBN)
>> with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
>> requests.
>
> Group fd?  Or container fd?  The group fd wouldn't make a lot of
> sense.


Group. KVM has no idea about containers.


>> To make use of the feature, the user space has to create a guest view
>> of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
>> then associate a LIOBN with this table via VFIO KVM device,
>> a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
>> the next patch).
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>
> Is that with or without DDW (i.e. with or without a 64-bit DMA window)?


Without DDW, I should have mentioned this. The patch is from the times when 
there was no DDW :(



>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
>>   arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
>>   2 files changed, 370 insertions(+)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 7965fc7..9417d12 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -33,6 +33,7 @@
>>   #include <asm/kvm_ppc.h>
>>   #include <asm/kvm_book3s.h>
>>   #include <asm/mmu-hash64.h>
>> +#include <asm/mmu_context.h>
>>   #include <asm/hvcall.h>
>>   #include <asm/synch.h>
>>   #include <asm/ppc-opcode.h>
>> @@ -317,11 +318,161 @@ fail:
>>   	return ret;
>>   }
>>
>> +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
>> +		unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup(*pua, pgsize);
>> +	if (!mem)
>> +		return H_HARDWARE;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
>> +		unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +
>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>
> H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
> supplied a bad physical address, doesn't it?

Well, may be. I'll change. If it not H_TOO_HARD, it does not make any 
difference after all :)



>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
>> +
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(tbl, entry);
>> +
>> +	*pua = ua;
>
> IIUC this means you have a copy of the UA for every group attached to
> the TCE table, but they'll all be the same. Any way to avoid that
> duplication?

It is for every container, not a group. On P8, I allow multiple groups to 
go to the same container, that means that a container has one or two 
iommu_table, and each iommu_table has this "ua" list but since tables are 
different (window size, page size, content), these "ua" arrays are also 
different.





-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
@ 2016-03-09  8:46       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  8:46 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/08/2016 10:08 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> Both real and virtual modes are supported. The kernel tries to
>> handle a TCE request in the real mode, if fails it passes the request
>> to the virtual mode to complete the operation. If it a virtual mode
>> handler fails, the request is passed to user space; this is not expected
>> to happen ever though.
>
> Well... not expect to happen with a qemu which uses this.  Presumably
> it will fall back to userspace routinely if you have an old qemu that
> doesn't add the liobn mappings.


Ah. Ok, thanks, I'll add this to the commit log.


>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>> user API functions are required for this patch.
>
> I'm not sure what you mean by "trampoline" here.

For example, look at kvm_vfio_group_get_external_user. It calls 
symbol_get(vfio_group_get_external_user) and then calls a function via the 
returned pointer.

Is there a better word for this?


>> This uses a VFIO KVM device to associate a logical bus number (LIOBN)
>> with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
>> requests.
>
> Group fd?  Or container fd?  The group fd wouldn't make a lot of
> sense.


Group. KVM has no idea about containers.


>> To make use of the feature, the user space has to create a guest view
>> of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
>> then associate a LIOBN with this table via VFIO KVM device,
>> a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
>> the next patch).
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>
> Is that with or without DDW (i.e. with or without a 64-bit DMA window)?


Without DDW, I should have mentioned this. The patch is from the times when 
there was no DDW :(



>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
>>   arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
>>   2 files changed, 370 insertions(+)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 7965fc7..9417d12 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -33,6 +33,7 @@
>>   #include <asm/kvm_ppc.h>
>>   #include <asm/kvm_book3s.h>
>>   #include <asm/mmu-hash64.h>
>> +#include <asm/mmu_context.h>
>>   #include <asm/hvcall.h>
>>   #include <asm/synch.h>
>>   #include <asm/ppc-opcode.h>
>> @@ -317,11 +318,161 @@ fail:
>>   	return ret;
>>   }
>>
>> +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
>> +		unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup(*pua, pgsize);
>> +	if (!mem)
>> +		return H_HARDWARE;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
>> +		unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +
>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir = DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		return H_HARDWARE;
>
> H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
> supplied a bad physical address, doesn't it?

Well, may be. I'll change. If it not H_TOO_HARD, it does not make any 
difference after all :)



>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>> +		return H_HARDWARE;
>> +
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_HARDWARE;
>> +
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(tbl, entry);
>> +
>> +	*pua = ua;
>
> IIUC this means you have a copy of the UA for every group attached to
> the TCE table, but they'll all be the same. Any way to avoid that
> duplication?

It is for every container, not a group. On P8, I allow multiple groups to 
go to the same container, that means that a container has one or two 
iommu_table, and each iommu_table has this "ua" list but since tables are 
different (window size, page size, content), these "ua" arrays are also 
different.





-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-03-08  6:30         ` David Gibson
@ 2016-03-09  8:55           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  8:55 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/08/2016 05:30 PM, David Gibson wrote:
> On Tue, Mar 08, 2016 at 04:47:20PM +1100, Alexey Kardashevskiy wrote:
>> On 03/07/2016 05:00 PM, David Gibson wrote:
>>> On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
>>>> VFIO on sPAPR already implements guest memory pre-registration
>>>> when the entire guest RAM gets pinned. This can be used to translate
>>>> the physical address of a guest page containing the TCE list
>>> >from H_PUT_TCE_INDIRECT.
>>>>
>>>> This makes use of the pre-registrered memory API to access TCE list
>>>> pages in order to avoid unnecessary locking on the KVM memory
>>>> reverse map.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> Ok.. so, what's the benefit of not having to lock the rmap?
>>
>> Less locking -> less racing == good, no?
>
> Well.. maybe.  The increased difficulty in verifying that the code is
> correct isn't always a good price to pay.
 >
>>>> ---
>>>>   arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
>>>>   1 file changed, 70 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 44be73e..af155f6 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>>>   EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>>>
>>>>   #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>>> +static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct task_struct *task;
>>>> +
>>>> +	task = vcpu->arch.run_task;
>>>> +	if (unlikely(!task || !task->mm))
>>>> +		return NULL;
>>>> +
>>>> +	return &task->mm->context;
>>>> +}
>>>> +
>>>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
>>>> +
>>>> +	if (unlikely(!mm))
>>>> +		return false;
>>>> +
>>>> +	return mm_iommu_preregistered(mm);
>>>> +}
>>>> +
>>>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>>>> +{
>>>> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
>>>> +
>>>> +	if (unlikely(!mm))
>>>> +		return NULL;
>>>> +
>>>> +	return mm_iommu_lookup_rm(mm, ua, size);
>>>> +}
>>>> +
>>>>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>   		      unsigned long ioba, unsigned long tce)
>>>>   {
>>>> @@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>   	if (ret != H_SUCCESS)
>>>>   		return ret;
>>>>
>>>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>>> -		return H_TOO_HARD;
>>>> +	if (kvmppc_preregistered(vcpu)) {
>>>> +		/*
>>>> +		 * We get here if guest memory was pre-registered which
>>>> +		 * is normally VFIO case and gpa->hpa translation does not
>>>> +		 * depend on hpt.
>>>> +		 */
>>>> +		struct mm_iommu_table_group_mem_t *mem;
>>>>
>>>> -	rmap = (void *) vmalloc_to_phys(rmap);
>>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>>> +			return H_TOO_HARD;
>>>>
>>>> -	/*
>>>> -	 * Synchronize with the MMU notifier callbacks in
>>>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>>> -	 * While we have the rmap lock, code running on other CPUs
>>>> -	 * cannot finish unmapping the host real page that backs
>>>> -	 * this guest real page, so we are OK to access the host
>>>> -	 * real page.
>>>> -	 */
>>>> -	lock_rmap(rmap);
>>>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>>> -		ret = H_TOO_HARD;
>>>> -		goto unlock_exit;
>>>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>>> +		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
>>>> +			return H_TOO_HARD;
>>>> +	} else {
>>>> +		/*
>>>> +		 * This is emulated devices case.
>>>> +		 * We do not require memory to be preregistered in this case
>>>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>>>> +		 */
>>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>>> +			return H_TOO_HARD;
>>>> +
>>>> +		rmap = (void *) vmalloc_to_phys(rmap);
>>>> +
>>>> +		/*
>>>> +		 * Synchronize with the MMU notifier callbacks in
>>>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>>> +		 * While we have the rmap lock, code running on other CPUs
>>>> +		 * cannot finish unmapping the host real page that backs
>>>> +		 * this guest real page, so we are OK to access the host
>>>> +		 * real page.
>>>> +		 */
>>>> +		lock_rmap(rmap);
>>>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>>> +			ret = H_TOO_HARD;
>>>> +			goto unlock_exit;
>>>> +		}
>>>>   	}
>>>>
>>>>   	for (i = 0; i < npages; ++i) {
>>>> @@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>   	}
>>>>
>>>>   unlock_exit:
>>>> -	unlock_rmap(rmap);
>>>> +	if (rmap)
>>>
>>> I don't see where rmap is initialized to NULL in the case where it's
>>> not being used.
>>
>> @rmap is not new to this function, and it has always been initialized to
>> NULL as it was returned via a pointer from kvmppc_gpa_to_ua().
>
> This comment confuses me.  Looking closer at the code I see you're
> right, and it's initialized to NULL where defined, which I missed.
>
> But that has nothing to do with being returned by pointer from
> kvmppc_gpa_to_ua(), since one of your branches in the new code no
> longer passes &rmap to that function.


So? The code is still correct - the "preregistered branch" does not touch 
NULL pointer, it remains NULL and unlock_rmap() is not called. I agree the 
patch is not the easiest to read but how can I improve it to get your "rb"? 
Replace "if(rmap)" with "if (kvmppc_preregistered(vcpu))"? Move that loop 
between lock_rmap/unlock_rmap to a helper? kvmppc_rm_h_put_tce_indirect() 
is not big enough to justify splitting, the comments inside it are though...



-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-03-09  8:55           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  8:55 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/08/2016 05:30 PM, David Gibson wrote:
> On Tue, Mar 08, 2016 at 04:47:20PM +1100, Alexey Kardashevskiy wrote:
>> On 03/07/2016 05:00 PM, David Gibson wrote:
>>> On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
>>>> VFIO on sPAPR already implements guest memory pre-registration
>>>> when the entire guest RAM gets pinned. This can be used to translate
>>>> the physical address of a guest page containing the TCE list
>>> >from H_PUT_TCE_INDIRECT.
>>>>
>>>> This makes use of the pre-registrered memory API to access TCE list
>>>> pages in order to avoid unnecessary locking on the KVM memory
>>>> reverse map.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> Ok.. so, what's the benefit of not having to lock the rmap?
>>
>> Less locking -> less racing = good, no?
>
> Well.. maybe.  The increased difficulty in verifying that the code is
> correct isn't always a good price to pay.
 >
>>>> ---
>>>>   arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
>>>>   1 file changed, 70 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 44be73e..af155f6 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>>>   EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>>>
>>>>   #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>>> +static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct task_struct *task;
>>>> +
>>>> +	task = vcpu->arch.run_task;
>>>> +	if (unlikely(!task || !task->mm))
>>>> +		return NULL;
>>>> +
>>>> +	return &task->mm->context;
>>>> +}
>>>> +
>>>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
>>>> +
>>>> +	if (unlikely(!mm))
>>>> +		return false;
>>>> +
>>>> +	return mm_iommu_preregistered(mm);
>>>> +}
>>>> +
>>>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>>>> +{
>>>> +	mm_context_t *mm = kvmppc_mm_context(vcpu);
>>>> +
>>>> +	if (unlikely(!mm))
>>>> +		return NULL;
>>>> +
>>>> +	return mm_iommu_lookup_rm(mm, ua, size);
>>>> +}
>>>> +
>>>>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>   		      unsigned long ioba, unsigned long tce)
>>>>   {
>>>> @@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>   	if (ret != H_SUCCESS)
>>>>   		return ret;
>>>>
>>>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>>> -		return H_TOO_HARD;
>>>> +	if (kvmppc_preregistered(vcpu)) {
>>>> +		/*
>>>> +		 * We get here if guest memory was pre-registered which
>>>> +		 * is normally VFIO case and gpa->hpa translation does not
>>>> +		 * depend on hpt.
>>>> +		 */
>>>> +		struct mm_iommu_table_group_mem_t *mem;
>>>>
>>>> -	rmap = (void *) vmalloc_to_phys(rmap);
>>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>>> +			return H_TOO_HARD;
>>>>
>>>> -	/*
>>>> -	 * Synchronize with the MMU notifier callbacks in
>>>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>>> -	 * While we have the rmap lock, code running on other CPUs
>>>> -	 * cannot finish unmapping the host real page that backs
>>>> -	 * this guest real page, so we are OK to access the host
>>>> -	 * real page.
>>>> -	 */
>>>> -	lock_rmap(rmap);
>>>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>>> -		ret = H_TOO_HARD;
>>>> -		goto unlock_exit;
>>>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>>> +		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
>>>> +			return H_TOO_HARD;
>>>> +	} else {
>>>> +		/*
>>>> +		 * This is emulated devices case.
>>>> +		 * We do not require memory to be preregistered in this case
>>>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>>>> +		 */
>>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>>> +			return H_TOO_HARD;
>>>> +
>>>> +		rmap = (void *) vmalloc_to_phys(rmap);
>>>> +
>>>> +		/*
>>>> +		 * Synchronize with the MMU notifier callbacks in
>>>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>>> +		 * While we have the rmap lock, code running on other CPUs
>>>> +		 * cannot finish unmapping the host real page that backs
>>>> +		 * this guest real page, so we are OK to access the host
>>>> +		 * real page.
>>>> +		 */
>>>> +		lock_rmap(rmap);
>>>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>>> +			ret = H_TOO_HARD;
>>>> +			goto unlock_exit;
>>>> +		}
>>>>   	}
>>>>
>>>>   	for (i = 0; i < npages; ++i) {
>>>> @@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>   	}
>>>>
>>>>   unlock_exit:
>>>> -	unlock_rmap(rmap);
>>>> +	if (rmap)
>>>
>>> I don't see where rmap is initialized to NULL in the case where it's
>>> not being used.
>>
>> @rmap is not new to this function, and it has always been initialized to
>> NULL as it was returned via a pointer from kvmppc_gpa_to_ua().
>
> This comment confuses me.  Looking closer at the code I see you're
> right, and it's initialized to NULL where defined, which I missed.
>
> But that has nothing to do with being returned by pointer from
> kvmppc_gpa_to_ua(), since one of your branches in the new code no
> longer passes &rmap to that function.


So? The code is still correct - the "preregistered branch" does not touch 
NULL pointer, it remains NULL and unlock_rmap() is not called. I agree the 
patch is not the easiest to read but how can I improve it to get your "rb"? 
Replace "if(rmap)" with "if (kvmppc_preregistered(vcpu))"? Move that loop 
between lock_rmap/unlock_rmap to a helper? kvmppc_rm_h_put_tce_indirect() 
is not big enough to justify splitting, the comments inside it are though...



-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-09  5:45     ` David Gibson
@ 2016-03-09  9:20       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  9:20 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/09/2016 04:45 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>> identifier. LIOBNs are made up, advertised to guest systems and
>> linked to IOMMU groups by the user space.
>> In order to enable acceleration for IOMMU operations in KVM, we need
>> to tell KVM the information about the LIOBN-to-group mapping.
>>
>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>> is added which accepts:
>> - a VFIO group fd and IO base address to find the actual hardware
>> TCE table;
>> - a LIOBN to assign to the found table.
>>
>> Before notifying KVM about new link, this check the group for being
>> registered with KVM device in order to release them at unexpected KVM
>> finish.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> While we are here, this also fixes VFIO KVM device compiling to let it
>> link to a KVM module.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>   include/uapi/linux/kvm.h                   |   9 +++
>>   virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740..c0d3eb7 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,24 @@ Groups:
>>
>>   KVM_DEV_VFIO_GROUP attributes:
>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>
> AFAICT these changes are accurate for VFIO as it is already, in which
> case it might be clearer to put them in a separate patch.
>
>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>> +	kvm_device_attr.addr points to a struct:
>> +		struct kvm_vfio_spapr_tce_liobn {
>> +			__u32	argsz;
>> +			__s32	fd;
>> +			__u32	liobn;
>> +			__u8	pad[4];
>> +			__u64	start_addr;
>> +		};
>> +		where
>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +		@fd is a file descriptor for a VFIO group;
>> +		@liobn is a logical bus id to be associated with the group;
>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>
> For the cause of DDW and multiple windows, I'm assuming you can call
> this multiple times with different LIOBNs and the same IOMMU group?


Yes. It is called twice per each group (when DDW is activated) - for 32bit 
and 64bit windows, this is why @start_addr is there.


>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> index 1059846..dfa3488 100644
>> --- a/arch/powerpc/kvm/Kconfig
>> +++ b/arch/powerpc/kvm/Kconfig
>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>   	select KVM
>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>> +	select KVM_VFIO if VFIO
>>   	---help---
>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>   	  in virtual machines on book3s_64 host processors.
>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 7f7b6d8..71f577c 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>   KVM := ../../../virt/kvm
>>
>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>> +		$(KVM)/eventfd.o
>
> Please don't disable the VFIO device for the non-book3s case.  I added
> it (even though it didn't do anything until now) so that libvirt
> wouldn't choke when it finds it's not available.  Obviously the new
> ioctl needs to be only for the right IOMMU setup, but the device
> itself should be available always.

Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.


>>   CFLAGS_e500_mmu.o := -I.
>>   CFLAGS_e500_mmu_host.o := -I.
>> @@ -87,6 +87,9 @@ endif
>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>   	book3s_xics.o
>>
>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>> +	$(KVM)/vfio.o \
>> +
>>   kvm-book3s_64-module-objs += \
>>   	$(KVM)/kvm_main.o \
>>   	$(KVM)/eventfd.o \
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 19aa59b..63f188d 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   	case KVM_CAP_SPAPR_TCE:
>>   	case KVM_CAP_SPAPR_TCE_64:
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>   	case KVM_CAP_PPC_RTAS:
>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 080ffbf..f1abbea 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>   #define  KVM_DEV_VFIO_GROUP			1
>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>
>>   enum kvm_device_type {
>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>   	KVM_DEV_TYPE_MAX,
>>   };
>>
>> +struct kvm_vfio_spapr_tce_liobn {
>> +	__u32	argsz;
>> +	__s32	fd;
>> +	__u32	liobn;
>> +	__u8	pad[4];
>> +	__u64	start_addr;
>> +};
>> +
>>   /*
>>    * ioctls for VM fds
>>    */
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index 1dd087d..87c771e 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>   #include <linux/vfio.h>
>>   #include "vfio.h"
>>
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>   struct kvm_vfio_group {
>>   	struct list_head node;
>>   	struct vfio_group *vfio_group;
>> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>   	symbol_put(vfio_group_put_external_user);
>>   }
>>
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>
> Should this be -ESOMETHING?
>
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>> +
>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>   {
>>   	long (*fn)(struct vfio_group *, unsigned long);
>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>   	mutex_unlock(&kv->lock);
>>   }
>>
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *vfio_group)
>
>
> Shouldn't this go in the same patch that introduced the attach
> function?

Having less patches which touch different maintainers areas is better. I 
cannot avoid touching both PPC KVM and VFIO in this patch but I can in 
"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE 
table".


>
>> +{
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (grp) {
>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>> +		iommu_group_put(grp);
>> +	}
>> +}
>> +#endif
>> +
>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   {
>>   	struct kvm_vfio *kv = dev->private;
>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   				continue;
>>
>>   			list_del(&kvg->node);
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>
> Better to make a no-op version of the call than have to #ifdef at the
> callsite.

It is questionable. A x86 reader may decide that 
KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get confused.


>
>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>> +					kvg->vfio_group);
>> +#endif
>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>   			kfree(kvg);
>>   			ret = 0;
>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   		kvm_vfio_update_coherency(dev);
>>
>>   		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>> +		struct kvm_vfio_spapr_tce_liobn param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>> +				start_addr);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.fd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
>> +		ret = -ENOENT;
>
> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> possible a kernel could be built for a platform supporting multiple
> IOMMU types.

Well, may make sense but I do not know to test that. The IOMMU type is a 
VFIO container property, not a group property and here (KVM) we only have 
groups.

And calling  iommu_group_get_iommudata() is quite useless as it returns a 
void pointer... I could probably check that the release() callback is the 
one I set via iommu_group_set_iommudata() but there is no API to get it 
from a group.

And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and 
therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can 
the same kernel binary image work on both BOOK3S and embedded PPC? Where 
these other types can come from?


>
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			int group_id;
>> +			struct iommu_group *grp;
>> +
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			group_id = kvm_vfio_external_user_iommu_id(
>> +					kvg->vfio_group);
>> +			grp = iommu_group_get_by_id(group_id);
>> +			if (!grp) {
>> +				ret = -EFAULT;
>> +				break;
>> +			}
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.liobn, param.start_addr,
>> +					grp);
>> +			if (ret)
>> +				iommu_group_put(grp);
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		kvm_vfio_group_put_external_user(vfio_group);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>   	}
>>
>>   	return -ENXIO;
>> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>   		switch (attr->attr) {
>>   		case KVM_DEV_VFIO_GROUP_ADD:
>>   		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
>> +#endif
>>   			return 0;
>>   		}
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-03-09  9:20       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  9:20 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/09/2016 04:45 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>> identifier. LIOBNs are made up, advertised to guest systems and
>> linked to IOMMU groups by the user space.
>> In order to enable acceleration for IOMMU operations in KVM, we need
>> to tell KVM the information about the LIOBN-to-group mapping.
>>
>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>> is added which accepts:
>> - a VFIO group fd and IO base address to find the actual hardware
>> TCE table;
>> - a LIOBN to assign to the found table.
>>
>> Before notifying KVM about new link, this check the group for being
>> registered with KVM device in order to release them at unexpected KVM
>> finish.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> While we are here, this also fixes VFIO KVM device compiling to let it
>> link to a KVM module.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>   include/uapi/linux/kvm.h                   |   9 +++
>>   virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740..c0d3eb7 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,24 @@ Groups:
>>
>>   KVM_DEV_VFIO_GROUP attributes:
>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>
> AFAICT these changes are accurate for VFIO as it is already, in which
> case it might be clearer to put them in a separate patch.
>
>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>> +	kvm_device_attr.addr points to a struct:
>> +		struct kvm_vfio_spapr_tce_liobn {
>> +			__u32	argsz;
>> +			__s32	fd;
>> +			__u32	liobn;
>> +			__u8	pad[4];
>> +			__u64	start_addr;
>> +		};
>> +		where
>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +		@fd is a file descriptor for a VFIO group;
>> +		@liobn is a logical bus id to be associated with the group;
>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>
> For the cause of DDW and multiple windows, I'm assuming you can call
> this multiple times with different LIOBNs and the same IOMMU group?


Yes. It is called twice per each group (when DDW is activated) - for 32bit 
and 64bit windows, this is why @start_addr is there.


>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> index 1059846..dfa3488 100644
>> --- a/arch/powerpc/kvm/Kconfig
>> +++ b/arch/powerpc/kvm/Kconfig
>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>   	select KVM
>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>> +	select KVM_VFIO if VFIO
>>   	---help---
>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>   	  in virtual machines on book3s_64 host processors.
>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 7f7b6d8..71f577c 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>   KVM := ../../../virt/kvm
>>
>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>> +		$(KVM)/eventfd.o
>
> Please don't disable the VFIO device for the non-book3s case.  I added
> it (even though it didn't do anything until now) so that libvirt
> wouldn't choke when it finds it's not available.  Obviously the new
> ioctl needs to be only for the right IOMMU setup, but the device
> itself should be available always.

Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.


>>   CFLAGS_e500_mmu.o := -I.
>>   CFLAGS_e500_mmu_host.o := -I.
>> @@ -87,6 +87,9 @@ endif
>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>   	book3s_xics.o
>>
>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>> +	$(KVM)/vfio.o \
>> +
>>   kvm-book3s_64-module-objs += \
>>   	$(KVM)/kvm_main.o \
>>   	$(KVM)/eventfd.o \
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 19aa59b..63f188d 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   	case KVM_CAP_SPAPR_TCE:
>>   	case KVM_CAP_SPAPR_TCE_64:
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>   	case KVM_CAP_PPC_RTAS:
>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 080ffbf..f1abbea 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>   #define  KVM_DEV_VFIO_GROUP			1
>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>
>>   enum kvm_device_type {
>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>   	KVM_DEV_TYPE_MAX,
>>   };
>>
>> +struct kvm_vfio_spapr_tce_liobn {
>> +	__u32	argsz;
>> +	__s32	fd;
>> +	__u32	liobn;
>> +	__u8	pad[4];
>> +	__u64	start_addr;
>> +};
>> +
>>   /*
>>    * ioctls for VM fds
>>    */
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index 1dd087d..87c771e 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>   #include <linux/vfio.h>
>>   #include "vfio.h"
>>
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>   struct kvm_vfio_group {
>>   	struct list_head node;
>>   	struct vfio_group *vfio_group;
>> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>   	symbol_put(vfio_group_put_external_user);
>>   }
>>
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>
> Should this be -ESOMETHING?
>
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>> +
>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>   {
>>   	long (*fn)(struct vfio_group *, unsigned long);
>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>   	mutex_unlock(&kv->lock);
>>   }
>>
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *vfio_group)
>
>
> Shouldn't this go in the same patch that introduced the attach
> function?

Having less patches which touch different maintainers areas is better. I 
cannot avoid touching both PPC KVM and VFIO in this patch but I can in 
"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE 
table".


>
>> +{
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (grp) {
>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>> +		iommu_group_put(grp);
>> +	}
>> +}
>> +#endif
>> +
>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   {
>>   	struct kvm_vfio *kv = dev->private;
>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   				continue;
>>
>>   			list_del(&kvg->node);
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>
> Better to make a no-op version of the call than have to #ifdef at the
> callsite.

It is questionable. A x86 reader may decide that 
KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get confused.


>
>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>> +					kvg->vfio_group);
>> +#endif
>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>   			kfree(kvg);
>>   			ret = 0;
>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   		kvm_vfio_update_coherency(dev);
>>
>>   		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>> +		struct kvm_vfio_spapr_tce_liobn param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>> +				start_addr);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.fd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
>> +		ret = -ENOENT;
>
> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> possible a kernel could be built for a platform supporting multiple
> IOMMU types.

Well, may make sense but I do not know to test that. The IOMMU type is a 
VFIO container property, not a group property and here (KVM) we only have 
groups.

And calling  iommu_group_get_iommudata() is quite useless as it returns a 
void pointer... I could probably check that the release() callback is the 
one I set via iommu_group_set_iommudata() but there is no API to get it 
from a group.

And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and 
therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can 
the same kernel binary image work on both BOOK3S and embedded PPC? Where 
these other types can come from?


>
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			int group_id;
>> +			struct iommu_group *grp;
>> +
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			group_id = kvm_vfio_external_user_iommu_id(
>> +					kvg->vfio_group);
>> +			grp = iommu_group_get_by_id(group_id);
>> +			if (!grp) {
>> +				ret = -EFAULT;
>> +				break;
>> +			}
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.liobn, param.start_addr,
>> +					grp);
>> +			if (ret)
>> +				iommu_group_put(grp);
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		kvm_vfio_group_put_external_user(vfio_group);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>   	}
>>
>>   	return -ENXIO;
>> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>   		switch (attr->attr) {
>>   		case KVM_DEV_VFIO_GROUP_ADD:
>>   		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
>> +#endif
>>   			return 0;
>>   		}
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-03-09  8:55           ` Alexey Kardashevskiy
@ 2016-03-09 23:46             ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-09 23:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 6674 bytes --]

On Wed, Mar 09, 2016 at 07:55:53PM +1100, Alexey Kardashevskiy wrote:
> On 03/08/2016 05:30 PM, David Gibson wrote:
> >On Tue, Mar 08, 2016 at 04:47:20PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/07/2016 05:00 PM, David Gibson wrote:
> >>>On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> >>>>VFIO on sPAPR already implements guest memory pre-registration
> >>>>when the entire guest RAM gets pinned. This can be used to translate
> >>>>the physical address of a guest page containing the TCE list
> >>>>from H_PUT_TCE_INDIRECT.
> >>>>
> >>>>This makes use of the pre-registrered memory API to access TCE list
> >>>>pages in order to avoid unnecessary locking on the KVM memory
> >>>>reverse map.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>
> >>>Ok.. so, what's the benefit of not having to lock the rmap?
> >>
> >>Less locking -> less racing == good, no?
> >
> >Well.. maybe.  The increased difficulty in verifying that the code is
> >correct isn't always a good price to pay.
> >
> >>>>---
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
> >>>>  1 file changed, 70 insertions(+), 16 deletions(-)
> >>>>
> >>>>diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>>index 44be73e..af155f6 100644
> >>>>--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>>+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>>@@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>>>
> >>>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >>>>+static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
> >>>>+{
> >>>>+	struct task_struct *task;
> >>>>+
> >>>>+	task = vcpu->arch.run_task;
> >>>>+	if (unlikely(!task || !task->mm))
> >>>>+		return NULL;
> >>>>+
> >>>>+	return &task->mm->context;
> >>>>+}
> >>>>+
> >>>>+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> >>>>+{
> >>>>+	mm_context_t *mm = kvmppc_mm_context(vcpu);
> >>>>+
> >>>>+	if (unlikely(!mm))
> >>>>+		return false;
> >>>>+
> >>>>+	return mm_iommu_preregistered(mm);
> >>>>+}
> >>>>+
> >>>>+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>>>+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> >>>>+{
> >>>>+	mm_context_t *mm = kvmppc_mm_context(vcpu);
> >>>>+
> >>>>+	if (unlikely(!mm))
> >>>>+		return NULL;
> >>>>+
> >>>>+	return mm_iommu_lookup_rm(mm, ua, size);
> >>>>+}
> >>>>+
> >>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  		      unsigned long ioba, unsigned long tce)
> >>>>  {
> >>>>@@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	if (ret != H_SUCCESS)
> >>>>  		return ret;
> >>>>
> >>>>-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>>>-		return H_TOO_HARD;
> >>>>+	if (kvmppc_preregistered(vcpu)) {
> >>>>+		/*
> >>>>+		 * We get here if guest memory was pre-registered which
> >>>>+		 * is normally VFIO case and gpa->hpa translation does not
> >>>>+		 * depend on hpt.
> >>>>+		 */
> >>>>+		struct mm_iommu_table_group_mem_t *mem;
> >>>>
> >>>>-	rmap = (void *) vmalloc_to_phys(rmap);
> >>>>+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>>>+			return H_TOO_HARD;
> >>>>
> >>>>-	/*
> >>>>-	 * Synchronize with the MMU notifier callbacks in
> >>>>-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>>>-	 * While we have the rmap lock, code running on other CPUs
> >>>>-	 * cannot finish unmapping the host real page that backs
> >>>>-	 * this guest real page, so we are OK to access the host
> >>>>-	 * real page.
> >>>>-	 */
> >>>>-	lock_rmap(rmap);
> >>>>-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>>>-		ret = H_TOO_HARD;
> >>>>-		goto unlock_exit;
> >>>>+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>>>+		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
> >>>>+			return H_TOO_HARD;
> >>>>+	} else {
> >>>>+		/*
> >>>>+		 * This is emulated devices case.
> >>>>+		 * We do not require memory to be preregistered in this case
> >>>>+		 * so lock rmap and do __find_linux_pte_or_hugepte().
> >>>>+		 */
> >>>>+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>>>+			return H_TOO_HARD;
> >>>>+
> >>>>+		rmap = (void *) vmalloc_to_phys(rmap);
> >>>>+
> >>>>+		/*
> >>>>+		 * Synchronize with the MMU notifier callbacks in
> >>>>+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>>>+		 * While we have the rmap lock, code running on other CPUs
> >>>>+		 * cannot finish unmapping the host real page that backs
> >>>>+		 * this guest real page, so we are OK to access the host
> >>>>+		 * real page.
> >>>>+		 */
> >>>>+		lock_rmap(rmap);
> >>>>+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>>>+			ret = H_TOO_HARD;
> >>>>+			goto unlock_exit;
> >>>>+		}
> >>>>  	}
> >>>>
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>>@@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>
> >>>>  unlock_exit:
> >>>>-	unlock_rmap(rmap);
> >>>>+	if (rmap)
> >>>
> >>>I don't see where rmap is initialized to NULL in the case where it's
> >>>not being used.
> >>
> >>@rmap is not new to this function, and it has always been initialized to
> >>NULL as it was returned via a pointer from kvmppc_gpa_to_ua().
> >
> >This comment confuses me.  Looking closer at the code I see you're
> >right, and it's initialized to NULL where defined, which I missed.
> >
> >But that has nothing to do with being returned by pointer from
> >kvmppc_gpa_to_ua(), since one of your branches in the new code no
> >longer passes &rmap to that function.
> 
> 
> So? The code is still correct - the "preregistered branch" does not touch
> NULL pointer, it remains NULL and unlock_rmap() is not called. I agree the
> patch is not the easiest to read but how can I improve it to get your "rb"?
> Replace "if(rmap)" with "if (kvmppc_preregistered(vcpu))"? Move that loop
> between lock_rmap/unlock_rmap to a helper? kvmppc_rm_h_put_tce_indirect() is
> not big enough to justify splitting, the comments inside it are though...

Sorry, I wasn't clear.  I no longer have a specific objection.  I left
out the R-b, since there have been enough other comments on the series
that I was expecting a respin, so I was planning to re-review in the
context of an updated series.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-03-09 23:46             ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-09 23:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 6674 bytes --]

On Wed, Mar 09, 2016 at 07:55:53PM +1100, Alexey Kardashevskiy wrote:
> On 03/08/2016 05:30 PM, David Gibson wrote:
> >On Tue, Mar 08, 2016 at 04:47:20PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/07/2016 05:00 PM, David Gibson wrote:
> >>>On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> >>>>VFIO on sPAPR already implements guest memory pre-registration
> >>>>when the entire guest RAM gets pinned. This can be used to translate
> >>>>the physical address of a guest page containing the TCE list
> >>>>from H_PUT_TCE_INDIRECT.
> >>>>
> >>>>This makes use of the pre-registrered memory API to access TCE list
> >>>>pages in order to avoid unnecessary locking on the KVM memory
> >>>>reverse map.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>
> >>>Ok.. so, what's the benefit of not having to lock the rmap?
> >>
> >>Less locking -> less racing == good, no?
> >
> >Well.. maybe.  The increased difficulty in verifying that the code is
> >correct isn't always a good price to pay.
> >
> >>>>---
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 86 ++++++++++++++++++++++++++++++-------
> >>>>  1 file changed, 70 insertions(+), 16 deletions(-)
> >>>>
> >>>>diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>>index 44be73e..af155f6 100644
> >>>>--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>>+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>>@@ -180,6 +180,38 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>>>
> >>>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >>>>+static mm_context_t *kvmppc_mm_context(struct kvm_vcpu *vcpu)
> >>>>+{
> >>>>+	struct task_struct *task;
> >>>>+
> >>>>+	task = vcpu->arch.run_task;
> >>>>+	if (unlikely(!task || !task->mm))
> >>>>+		return NULL;
> >>>>+
> >>>>+	return &task->mm->context;
> >>>>+}
> >>>>+
> >>>>+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> >>>>+{
> >>>>+	mm_context_t *mm = kvmppc_mm_context(vcpu);
> >>>>+
> >>>>+	if (unlikely(!mm))
> >>>>+		return false;
> >>>>+
> >>>>+	return mm_iommu_preregistered(mm);
> >>>>+}
> >>>>+
> >>>>+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> >>>>+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> >>>>+{
> >>>>+	mm_context_t *mm = kvmppc_mm_context(vcpu);
> >>>>+
> >>>>+	if (unlikely(!mm))
> >>>>+		return NULL;
> >>>>+
> >>>>+	return mm_iommu_lookup_rm(mm, ua, size);
> >>>>+}
> >>>>+
> >>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>>>  		      unsigned long ioba, unsigned long tce)
> >>>>  {
> >>>>@@ -261,23 +293,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	if (ret != H_SUCCESS)
> >>>>  		return ret;
> >>>>
> >>>>-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>>>-		return H_TOO_HARD;
> >>>>+	if (kvmppc_preregistered(vcpu)) {
> >>>>+		/*
> >>>>+		 * We get here if guest memory was pre-registered which
> >>>>+		 * is normally VFIO case and gpa->hpa translation does not
> >>>>+		 * depend on hpt.
> >>>>+		 */
> >>>>+		struct mm_iommu_table_group_mem_t *mem;
> >>>>
> >>>>-	rmap = (void *) vmalloc_to_phys(rmap);
> >>>>+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> >>>>+			return H_TOO_HARD;
> >>>>
> >>>>-	/*
> >>>>-	 * Synchronize with the MMU notifier callbacks in
> >>>>-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>>>-	 * While we have the rmap lock, code running on other CPUs
> >>>>-	 * cannot finish unmapping the host real page that backs
> >>>>-	 * this guest real page, so we are OK to access the host
> >>>>-	 * real page.
> >>>>-	 */
> >>>>-	lock_rmap(rmap);
> >>>>-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>>>-		ret = H_TOO_HARD;
> >>>>-		goto unlock_exit;
> >>>>+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> >>>>+		if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces))
> >>>>+			return H_TOO_HARD;
> >>>>+	} else {
> >>>>+		/*
> >>>>+		 * This is emulated devices case.
> >>>>+		 * We do not require memory to be preregistered in this case
> >>>>+		 * so lock rmap and do __find_linux_pte_or_hugepte().
> >>>>+		 */
> >>>>+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> >>>>+			return H_TOO_HARD;
> >>>>+
> >>>>+		rmap = (void *) vmalloc_to_phys(rmap);
> >>>>+
> >>>>+		/*
> >>>>+		 * Synchronize with the MMU notifier callbacks in
> >>>>+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> >>>>+		 * While we have the rmap lock, code running on other CPUs
> >>>>+		 * cannot finish unmapping the host real page that backs
> >>>>+		 * this guest real page, so we are OK to access the host
> >>>>+		 * real page.
> >>>>+		 */
> >>>>+		lock_rmap(rmap);
> >>>>+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> >>>>+			ret = H_TOO_HARD;
> >>>>+			goto unlock_exit;
> >>>>+		}
> >>>>  	}
> >>>>
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>>@@ -291,7 +344,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>
> >>>>  unlock_exit:
> >>>>-	unlock_rmap(rmap);
> >>>>+	if (rmap)
> >>>
> >>>I don't see where rmap is initialized to NULL in the case where it's
> >>>not being used.
> >>
> >>@rmap is not new to this function, and it has always been initialized to
> >>NULL as it was returned via a pointer from kvmppc_gpa_to_ua().
> >
> >This comment confuses me.  Looking closer at the code I see you're
> >right, and it's initialized to NULL where defined, which I missed.
> >
> >But that has nothing to do with being returned by pointer from
> >kvmppc_gpa_to_ua(), since one of your branches in the new code no
> >longer passes &rmap to that function.
> 
> 
> So? The code is still correct - the "preregistered branch" does not touch
> NULL pointer, it remains NULL and unlock_rmap() is not called. I agree the
> patch is not the easiest to read but how can I improve it to get your "rb"?
> Replace "if(rmap)" with "if (kvmppc_preregistered(vcpu))"? Move that loop
> between lock_rmap/unlock_rmap to a helper? kvmppc_rm_h_put_tce_indirect() is
> not big enough to justify splitting, the comments inside it are though...

Sorry, I wasn't clear.  I no longer have a specific objection.  I left
out the R-b, since there have been enough other comments on the series
that I was expecting a respin, so I was planning to re-review in the
context of an updated series.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
  2016-03-09  8:46       ` Alexey Kardashevskiy
@ 2016-03-10  5:18         ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-10  5:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 7212 bytes --]

On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote:
> On 03/08/2016 10:08 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
> >>This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>without passing them to user space which saves time on switching
> >>to user space and back.
> >>
> >>Both real and virtual modes are supported. The kernel tries to
> >>handle a TCE request in the real mode, if fails it passes the request
> >>to the virtual mode to complete the operation. If it a virtual mode
> >>handler fails, the request is passed to user space; this is not expected
> >>to happen ever though.
> >
> >Well... not expect to happen with a qemu which uses this.  Presumably
> >it will fall back to userspace routinely if you have an old qemu that
> >doesn't add the liobn mappings.
> 
> 
> Ah. Ok, thanks, I'll add this to the commit log.

Ok.

> >>The first user of this is VFIO on POWER. Trampolines to the VFIO external
> >>user API functions are required for this patch.
> >
> >I'm not sure what you mean by "trampoline" here.
> 
> For example, look at kvm_vfio_group_get_external_user. It calls
> symbol_get(vfio_group_get_external_user) and then calls a function via the
> returned pointer.
> 
> Is there a better word for this?

Uh.. probably although I don't immediately know what.  "Trampoline"
usually refers to code on the stack used for bouncing places, which
isn't what this resembles.

> >>This uses a VFIO KVM device to associate a logical bus number (LIOBN)
> >>with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
> >>requests.
> >
> >Group fd?  Or container fd?  The group fd wouldn't make a lot of
> >sense.
> 
> 
> Group. KVM has no idea about containers.

That's not going to fly.  Having a liobn registered against just one
group in a container makes no sense at all.  Conceptually, if not
physically, the container shares a single set of TCE tables.  If
handling that means teaching KVM the concept of containers, then so be
it.

Btw, I'm not sure yet if extending the existing vfio kvm device to
make the vfio<->kvm linkages makes sense.  I think the reason some x86
machines need that is quite different from how we're using it for
Power.  I haven't got a clear enough picture yet to be sure either
way.

The other option that would seem likely to me would be a "bind VFIO
container" ioctl() on the fd associated with a kernel accelerated TCE table.

> >>To make use of the feature, the user space has to create a guest view
> >>of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
> >>then associate a LIOBN with this table via VFIO KVM device,
> >>a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
> >>the next patch).
> >>
> >>Tests show that this patch increases transmission speed from 220MB/s
> >>to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >
> >Is that with or without DDW (i.e. with or without a 64-bit DMA window)?
> 
> 
> Without DDW, I should have mentioned this. The patch is from the times when
> there was no DDW :(

Ok.

> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 370 insertions(+)
> >>
> >>diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>index 7965fc7..9417d12 100644
> >>--- a/arch/powerpc/kvm/book3s_64_vio.c
> >>+++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>@@ -33,6 +33,7 @@
> >>  #include <asm/kvm_ppc.h>
> >>  #include <asm/kvm_book3s.h>
> >>  #include <asm/mmu-hash64.h>
> >>+#include <asm/mmu_context.h>
> >>  #include <asm/hvcall.h>
> >>  #include <asm/synch.h>
> >>  #include <asm/ppc-opcode.h>
> >>@@ -317,11 +318,161 @@ fail:
> >>  	return ret;
> >>  }
> >>
> >>+static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
> >>+		unsigned long entry)
> >>+{
> >>+	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >>+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>+
> >>+	if (!pua)
> >>+		return H_HARDWARE;
> >>+
> >>+	mem = mm_iommu_lookup(*pua, pgsize);
> >>+	if (!mem)
> >>+		return H_HARDWARE;
> >>+
> >>+	mm_iommu_mapped_dec(mem);
> >>+
> >>+	*pua = 0;
> >>+
> >>+	return H_SUCCESS;
> >>+}
> >>+
> >>+static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
> >>+		unsigned long entry)
> >>+{
> >>+	enum dma_data_direction dir = DMA_NONE;
> >>+	unsigned long hpa = 0;
> >>+
> >>+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >>+		return H_HARDWARE;
> >>+
> >>+	if (dir == DMA_NONE)
> >>+		return H_SUCCESS;
> >>+
> >>+	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
> >>+}
> >>+
> >>+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >>+		unsigned long entry, unsigned long gpa,
> >>+		enum dma_data_direction dir)
> >>+{
> >>+	long ret;
> >>+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>+	struct mm_iommu_table_group_mem_t *mem;
> >>+
> >>+	if (!pua)
> >>+		return H_HARDWARE;
> >
> >H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
> >supplied a bad physical address, doesn't it?
> 
> Well, may be. I'll change. If it not H_TOO_HARD, it does not make any
> difference after all :)
> 
> 
> 
> >>+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >>+		return H_HARDWARE;
> >>+
> >>+	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
> >>+	if (!mem)
> >>+		return H_HARDWARE;
> >>+
> >>+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >>+		return H_HARDWARE;
> >>+
> >>+	if (mm_iommu_mapped_inc(mem))
> >>+		return H_HARDWARE;
> >>+
> >>+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >>+	if (ret) {
> >>+		mm_iommu_mapped_dec(mem);
> >>+		return H_TOO_HARD;
> >>+	}
> >>+
> >>+	if (dir != DMA_NONE)
> >>+		kvmppc_tce_iommu_mapped_dec(tbl, entry);
> >>+
> >>+	*pua = ua;
> >
> >IIUC this means you have a copy of the UA for every group attached to
> >the TCE table, but they'll all be the same. Any way to avoid that
> >duplication?
> 
> It is for every container, not a group. On P8, I allow multiple groups to go
> to the same container, that means that a container has one or two
> iommu_table, and each iommu_table has this "ua" list but since tables are
> different (window size, page size, content), these "ua" arrays are also
> different.

Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups
attached to the stt, and each one seems to update pua.

Or is that what the if (kg->tbl == tbltmp) continue; is supposed to
avoid?  In which case what ensures that the stt->groups list is
ordered by tbl pointer?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
@ 2016-03-10  5:18         ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-10  5:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 7212 bytes --]

On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote:
> On 03/08/2016 10:08 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
> >>This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>without passing them to user space which saves time on switching
> >>to user space and back.
> >>
> >>Both real and virtual modes are supported. The kernel tries to
> >>handle a TCE request in the real mode, if fails it passes the request
> >>to the virtual mode to complete the operation. If it a virtual mode
> >>handler fails, the request is passed to user space; this is not expected
> >>to happen ever though.
> >
> >Well... not expect to happen with a qemu which uses this.  Presumably
> >it will fall back to userspace routinely if you have an old qemu that
> >doesn't add the liobn mappings.
> 
> 
> Ah. Ok, thanks, I'll add this to the commit log.

Ok.

> >>The first user of this is VFIO on POWER. Trampolines to the VFIO external
> >>user API functions are required for this patch.
> >
> >I'm not sure what you mean by "trampoline" here.
> 
> For example, look at kvm_vfio_group_get_external_user. It calls
> symbol_get(vfio_group_get_external_user) and then calls a function via the
> returned pointer.
> 
> Is there a better word for this?

Uh.. probably although I don't immediately know what.  "Trampoline"
usually refers to code on the stack used for bouncing places, which
isn't what this resembles.

> >>This uses a VFIO KVM device to associate a logical bus number (LIOBN)
> >>with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
> >>requests.
> >
> >Group fd?  Or container fd?  The group fd wouldn't make a lot of
> >sense.
> 
> 
> Group. KVM has no idea about containers.

That's not going to fly.  Having a liobn registered against just one
group in a container makes no sense at all.  Conceptually, if not
physically, the container shares a single set of TCE tables.  If
handling that means teaching KVM the concept of containers, then so be
it.

Btw, I'm not sure yet if extending the existing vfio kvm device to
make the vfio<->kvm linkages makes sense.  I think the reason some x86
machines need that is quite different from how we're using it for
Power.  I haven't got a clear enough picture yet to be sure either
way.

The other option that would seem likely to me would be a "bind VFIO
container" ioctl() on the fd associated with a kernel accelerated TCE table.

> >>To make use of the feature, the user space has to create a guest view
> >>of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
> >>then associate a LIOBN with this table via VFIO KVM device,
> >>a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
> >>the next patch).
> >>
> >>Tests show that this patch increases transmission speed from 220MB/s
> >>to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >
> >Is that with or without DDW (i.e. with or without a 64-bit DMA window)?
> 
> 
> Without DDW, I should have mentioned this. The patch is from the times when
> there was no DDW :(

Ok.

> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 370 insertions(+)
> >>
> >>diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>index 7965fc7..9417d12 100644
> >>--- a/arch/powerpc/kvm/book3s_64_vio.c
> >>+++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>@@ -33,6 +33,7 @@
> >>  #include <asm/kvm_ppc.h>
> >>  #include <asm/kvm_book3s.h>
> >>  #include <asm/mmu-hash64.h>
> >>+#include <asm/mmu_context.h>
> >>  #include <asm/hvcall.h>
> >>  #include <asm/synch.h>
> >>  #include <asm/ppc-opcode.h>
> >>@@ -317,11 +318,161 @@ fail:
> >>  	return ret;
> >>  }
> >>
> >>+static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
> >>+		unsigned long entry)
> >>+{
> >>+	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >>+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>+
> >>+	if (!pua)
> >>+		return H_HARDWARE;
> >>+
> >>+	mem = mm_iommu_lookup(*pua, pgsize);
> >>+	if (!mem)
> >>+		return H_HARDWARE;
> >>+
> >>+	mm_iommu_mapped_dec(mem);
> >>+
> >>+	*pua = 0;
> >>+
> >>+	return H_SUCCESS;
> >>+}
> >>+
> >>+static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
> >>+		unsigned long entry)
> >>+{
> >>+	enum dma_data_direction dir = DMA_NONE;
> >>+	unsigned long hpa = 0;
> >>+
> >>+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >>+		return H_HARDWARE;
> >>+
> >>+	if (dir == DMA_NONE)
> >>+		return H_SUCCESS;
> >>+
> >>+	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
> >>+}
> >>+
> >>+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >>+		unsigned long entry, unsigned long gpa,
> >>+		enum dma_data_direction dir)
> >>+{
> >>+	long ret;
> >>+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>+	struct mm_iommu_table_group_mem_t *mem;
> >>+
> >>+	if (!pua)
> >>+		return H_HARDWARE;
> >
> >H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
> >supplied a bad physical address, doesn't it?
> 
> Well, may be. I'll change. If it not H_TOO_HARD, it does not make any
> difference after all :)
> 
> 
> 
> >>+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >>+		return H_HARDWARE;
> >>+
> >>+	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
> >>+	if (!mem)
> >>+		return H_HARDWARE;
> >>+
> >>+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >>+		return H_HARDWARE;
> >>+
> >>+	if (mm_iommu_mapped_inc(mem))
> >>+		return H_HARDWARE;
> >>+
> >>+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >>+	if (ret) {
> >>+		mm_iommu_mapped_dec(mem);
> >>+		return H_TOO_HARD;
> >>+	}
> >>+
> >>+	if (dir != DMA_NONE)
> >>+		kvmppc_tce_iommu_mapped_dec(tbl, entry);
> >>+
> >>+	*pua = ua;
> >
> >IIUC this means you have a copy of the UA for every group attached to
> >the TCE table, but they'll all be the same. Any way to avoid that
> >duplication?
> 
> It is for every container, not a group. On P8, I allow multiple groups to go
> to the same container, that means that a container has one or two
> iommu_table, and each iommu_table has this "ua" list but since tables are
> different (window size, page size, content), these "ua" arrays are also
> different.

Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups
attached to the stt, and each one seems to update pua.

Or is that what the if (kg->tbl == tbltmp) continue; is supposed to
avoid?  In which case what ensures that the stt->groups list is
ordered by tbl pointer?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-09  9:20       ` Alexey Kardashevskiy
@ 2016-03-10  5:21         ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-10  5:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 12653 bytes --]

On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> On 03/09/2016 04:45 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>identifier. LIOBNs are made up, advertised to guest systems and
> >>linked to IOMMU groups by the user space.
> >>In order to enable acceleration for IOMMU operations in KVM, we need
> >>to tell KVM the information about the LIOBN-to-group mapping.
> >>
> >>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>is added which accepts:
> >>- a VFIO group fd and IO base address to find the actual hardware
> >>TCE table;
> >>- a LIOBN to assign to the found table.
> >>
> >>Before notifying KVM about new link, this check the group for being
> >>registered with KVM device in order to release them at unexpected KVM
> >>finish.
> >>
> >>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>space.
> >>
> >>While we are here, this also fixes VFIO KVM device compiling to let it
> >>link to a KVM module.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>  include/uapi/linux/kvm.h                   |   9 +++
> >>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
> >>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>
> >>diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>index ef51740..c0d3eb7 100644
> >>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>@@ -16,7 +16,24 @@ Groups:
> >>
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>+	for the VFIO group.
> >
> >AFAICT these changes are accurate for VFIO as it is already, in which
> >case it might be clearer to put them in a separate patch.
> >
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>+	for the VFIO group.
> >>
> >>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>-for the VFIO group.
> >>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>+	kvm_device_attr.addr points to a struct:
> >>+		struct kvm_vfio_spapr_tce_liobn {
> >>+			__u32	argsz;
> >>+			__s32	fd;
> >>+			__u32	liobn;
> >>+			__u8	pad[4];
> >>+			__u64	start_addr;
> >>+		};
> >>+		where
> >>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>+		@fd is a file descriptor for a VFIO group;
> >>+		@liobn is a logical bus id to be associated with the group;
> >>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >
> >For the cause of DDW and multiple windows, I'm assuming you can call
> >this multiple times with different LIOBNs and the same IOMMU group?
> 
> 
> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> and 64bit windows, this is why @start_addr is there.
> 
> 
> >>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>index 1059846..dfa3488 100644
> >>--- a/arch/powerpc/kvm/Kconfig
> >>+++ b/arch/powerpc/kvm/Kconfig
> >>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>  	select KVM
> >>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>+	select KVM_VFIO if VFIO
> >>  	---help---
> >>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>  	  in virtual machines on book3s_64 host processors.
> >>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>index 7f7b6d8..71f577c 100644
> >>--- a/arch/powerpc/kvm/Makefile
> >>+++ b/arch/powerpc/kvm/Makefile
> >>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>  KVM := ../../../virt/kvm
> >>
> >>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>+		$(KVM)/eventfd.o
> >
> >Please don't disable the VFIO device for the non-book3s case.  I added
> >it (even though it didn't do anything until now) so that libvirt
> >wouldn't choke when it finds it's not available.  Obviously the new
> >ioctl needs to be only for the right IOMMU setup, but the device
> >itself should be available always.
> 
> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> 
> 
> >>  CFLAGS_e500_mmu.o := -I.
> >>  CFLAGS_e500_mmu_host.o := -I.
> >>@@ -87,6 +87,9 @@ endif
> >>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>  	book3s_xics.o
> >>
> >>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>+	$(KVM)/vfio.o \
> >>+
> >>  kvm-book3s_64-module-objs += \
> >>  	$(KVM)/kvm_main.o \
> >>  	$(KVM)/eventfd.o \
> >>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>index 19aa59b..63f188d 100644
> >>--- a/arch/powerpc/kvm/powerpc.c
> >>+++ b/arch/powerpc/kvm/powerpc.c
> >>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>index 080ffbf..f1abbea 100644
> >>--- a/include/uapi/linux/kvm.h
> >>+++ b/include/uapi/linux/kvm.h
> >>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>
> >>+struct kvm_vfio_spapr_tce_liobn {
> >>+	__u32	argsz;
> >>+	__s32	fd;
> >>+	__u32	liobn;
> >>+	__u8	pad[4];
> >>+	__u64	start_addr;
> >>+};
> >>+
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>index 1dd087d..87c771e 100644
> >>--- a/virt/kvm/vfio.c
> >>+++ b/virt/kvm/vfio.c
> >>@@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+#include <asm/kvm_ppc.h>
> >>+#endif
> >>+
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >>@@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>  	symbol_put(vfio_group_put_external_user);
> >>  }
> >>
> >>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>+{
> >>+	int (*fn)(struct vfio_group *);
> >>+	int ret = -1;
> >
> >Should this be -ESOMETHING?
> >
> >>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>+	if (!fn)
> >>+		return ret;
> >>+
> >>+	ret = fn(vfio_group);
> >>+
> >>+	symbol_put(vfio_external_user_iommu_id);
> >>+
> >>+	return ret;
> >>+}
> >>+
> >>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>  {
> >>  	long (*fn)(struct vfio_group *, unsigned long);
> >>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>  	mutex_unlock(&kv->lock);
> >>  }
> >>
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>+		struct vfio_group *vfio_group)
> >
> >
> >Shouldn't this go in the same patch that introduced the attach
> >function?
> 
> Having less patches which touch different maintainers areas is better. I
> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> table".
> 
> 
> >
> >>+{
> >>+	int group_id;
> >>+	struct iommu_group *grp;
> >>+
> >>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>+	grp = iommu_group_get_by_id(group_id);
> >>+	if (grp) {
> >>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>+		iommu_group_put(grp);
> >>+	}
> >>+}
> >>+#endif
> >>+
> >>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  {
> >>  	struct kvm_vfio *kv = dev->private;
> >>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  				continue;
> >>
> >>  			list_del(&kvg->node);
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >
> >Better to make a no-op version of the call than have to #ifdef at the
> >callsite.
> 
> It is questionable. A x86 reader may decide that
> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> confused.
> 
> 
> >
> >>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>+					kvg->vfio_group);
> >>+#endif
> >>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  			kfree(kvg);
> >>  			ret = 0;
> >>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>
> >>  		return ret;
> >>+
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>+		struct kvm_vfio_spapr_tce_liobn param;
> >>+		unsigned long minsz;
> >>+		struct kvm_vfio *kv = dev->private;
> >>+		struct vfio_group *vfio_group;
> >>+		struct kvm_vfio_group *kvg;
> >>+		struct fd f;
> >>+
> >>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>+				start_addr);
> >>+
> >>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>+			return -EFAULT;
> >>+
> >>+		if (param.argsz < minsz)
> >>+			return -EINVAL;
> >>+
> >>+		f = fdget(param.fd);
> >>+		if (!f.file)
> >>+			return -EBADF;
> >>+
> >>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>+		fdput(f);
> >>+
> >>+		if (IS_ERR(vfio_group))
> >>+			return PTR_ERR(vfio_group);
> >>+
> >>+		ret = -ENOENT;
> >
> >Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >possible a kernel could be built for a platform supporting multiple
> >IOMMU types.
> 
> Well, may make sense but I do not know to test that. The IOMMU type is a
> VFIO container property, not a group property and here (KVM) we only have
> groups.

Which, as mentioned previously, is broken.

> And calling  iommu_group_get_iommudata() is quite useless as it returns a
> void pointer... I could probably check that the release() callback is the
> one I set via iommu_group_set_iommudata() but there is no API to get it from
> a group.
> 
> And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and
> therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can
> the same kernel binary image work on both BOOK3S and embedded PPC? Where
> these other types can come from?
> 
> 
> >
> >>+		mutex_lock(&kv->lock);
> >>+
> >>+		list_for_each_entry(kvg, &kv->group_list, node) {
> >>+			int group_id;
> >>+			struct iommu_group *grp;
> >>+
> >>+			if (kvg->vfio_group != vfio_group)
> >>+				continue;
> >>+
> >>+			group_id = kvm_vfio_external_user_iommu_id(
> >>+					kvg->vfio_group);
> >>+			grp = iommu_group_get_by_id(group_id);
> >>+			if (!grp) {
> >>+				ret = -EFAULT;
> >>+				break;
> >>+			}
> >>+
> >>+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >>+					param.liobn, param.start_addr,
> >>+					grp);
> >>+			if (ret)
> >>+				iommu_group_put(grp);
> >>+			break;
> >>+		}
> >>+
> >>+		mutex_unlock(&kv->lock);
> >>+
> >>+		kvm_vfio_group_put_external_user(vfio_group);
> >>+
> >>+		return ret;
> >>+	}
> >>+#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> >>
> >>  	return -ENXIO;
> >>@@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
> >>+#endif
> >>  			return 0;
> >>  		}
> >>
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-03-10  5:21         ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-10  5:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 12653 bytes --]

On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> On 03/09/2016 04:45 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>identifier. LIOBNs are made up, advertised to guest systems and
> >>linked to IOMMU groups by the user space.
> >>In order to enable acceleration for IOMMU operations in KVM, we need
> >>to tell KVM the information about the LIOBN-to-group mapping.
> >>
> >>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>is added which accepts:
> >>- a VFIO group fd and IO base address to find the actual hardware
> >>TCE table;
> >>- a LIOBN to assign to the found table.
> >>
> >>Before notifying KVM about new link, this check the group for being
> >>registered with KVM device in order to release them at unexpected KVM
> >>finish.
> >>
> >>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>space.
> >>
> >>While we are here, this also fixes VFIO KVM device compiling to let it
> >>link to a KVM module.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>  include/uapi/linux/kvm.h                   |   9 +++
> >>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
> >>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>
> >>diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>index ef51740..c0d3eb7 100644
> >>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>@@ -16,7 +16,24 @@ Groups:
> >>
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>+	for the VFIO group.
> >
> >AFAICT these changes are accurate for VFIO as it is already, in which
> >case it might be clearer to put them in a separate patch.
> >
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>+	for the VFIO group.
> >>
> >>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>-for the VFIO group.
> >>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>+	kvm_device_attr.addr points to a struct:
> >>+		struct kvm_vfio_spapr_tce_liobn {
> >>+			__u32	argsz;
> >>+			__s32	fd;
> >>+			__u32	liobn;
> >>+			__u8	pad[4];
> >>+			__u64	start_addr;
> >>+		};
> >>+		where
> >>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>+		@fd is a file descriptor for a VFIO group;
> >>+		@liobn is a logical bus id to be associated with the group;
> >>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >
> >For the cause of DDW and multiple windows, I'm assuming you can call
> >this multiple times with different LIOBNs and the same IOMMU group?
> 
> 
> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> and 64bit windows, this is why @start_addr is there.
> 
> 
> >>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>index 1059846..dfa3488 100644
> >>--- a/arch/powerpc/kvm/Kconfig
> >>+++ b/arch/powerpc/kvm/Kconfig
> >>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>  	select KVM
> >>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>+	select KVM_VFIO if VFIO
> >>  	---help---
> >>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>  	  in virtual machines on book3s_64 host processors.
> >>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>index 7f7b6d8..71f577c 100644
> >>--- a/arch/powerpc/kvm/Makefile
> >>+++ b/arch/powerpc/kvm/Makefile
> >>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>  KVM := ../../../virt/kvm
> >>
> >>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>+		$(KVM)/eventfd.o
> >
> >Please don't disable the VFIO device for the non-book3s case.  I added
> >it (even though it didn't do anything until now) so that libvirt
> >wouldn't choke when it finds it's not available.  Obviously the new
> >ioctl needs to be only for the right IOMMU setup, but the device
> >itself should be available always.
> 
> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> 
> 
> >>  CFLAGS_e500_mmu.o := -I.
> >>  CFLAGS_e500_mmu_host.o := -I.
> >>@@ -87,6 +87,9 @@ endif
> >>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>  	book3s_xics.o
> >>
> >>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>+	$(KVM)/vfio.o \
> >>+
> >>  kvm-book3s_64-module-objs += \
> >>  	$(KVM)/kvm_main.o \
> >>  	$(KVM)/eventfd.o \
> >>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>index 19aa59b..63f188d 100644
> >>--- a/arch/powerpc/kvm/powerpc.c
> >>+++ b/arch/powerpc/kvm/powerpc.c
> >>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>index 080ffbf..f1abbea 100644
> >>--- a/include/uapi/linux/kvm.h
> >>+++ b/include/uapi/linux/kvm.h
> >>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>
> >>+struct kvm_vfio_spapr_tce_liobn {
> >>+	__u32	argsz;
> >>+	__s32	fd;
> >>+	__u32	liobn;
> >>+	__u8	pad[4];
> >>+	__u64	start_addr;
> >>+};
> >>+
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>index 1dd087d..87c771e 100644
> >>--- a/virt/kvm/vfio.c
> >>+++ b/virt/kvm/vfio.c
> >>@@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+#include <asm/kvm_ppc.h>
> >>+#endif
> >>+
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >>@@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>  	symbol_put(vfio_group_put_external_user);
> >>  }
> >>
> >>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>+{
> >>+	int (*fn)(struct vfio_group *);
> >>+	int ret = -1;
> >
> >Should this be -ESOMETHING?
> >
> >>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>+	if (!fn)
> >>+		return ret;
> >>+
> >>+	ret = fn(vfio_group);
> >>+
> >>+	symbol_put(vfio_external_user_iommu_id);
> >>+
> >>+	return ret;
> >>+}
> >>+
> >>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>  {
> >>  	long (*fn)(struct vfio_group *, unsigned long);
> >>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>  	mutex_unlock(&kv->lock);
> >>  }
> >>
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>+		struct vfio_group *vfio_group)
> >
> >
> >Shouldn't this go in the same patch that introduced the attach
> >function?
> 
> Having less patches which touch different maintainers areas is better. I
> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> table".
> 
> 
> >
> >>+{
> >>+	int group_id;
> >>+	struct iommu_group *grp;
> >>+
> >>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>+	grp = iommu_group_get_by_id(group_id);
> >>+	if (grp) {
> >>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>+		iommu_group_put(grp);
> >>+	}
> >>+}
> >>+#endif
> >>+
> >>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  {
> >>  	struct kvm_vfio *kv = dev->private;
> >>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  				continue;
> >>
> >>  			list_del(&kvg->node);
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >
> >Better to make a no-op version of the call than have to #ifdef at the
> >callsite.
> 
> It is questionable. A x86 reader may decide that
> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> confused.
> 
> 
> >
> >>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>+					kvg->vfio_group);
> >>+#endif
> >>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  			kfree(kvg);
> >>  			ret = 0;
> >>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>
> >>  		return ret;
> >>+
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>+		struct kvm_vfio_spapr_tce_liobn param;
> >>+		unsigned long minsz;
> >>+		struct kvm_vfio *kv = dev->private;
> >>+		struct vfio_group *vfio_group;
> >>+		struct kvm_vfio_group *kvg;
> >>+		struct fd f;
> >>+
> >>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>+				start_addr);
> >>+
> >>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>+			return -EFAULT;
> >>+
> >>+		if (param.argsz < minsz)
> >>+			return -EINVAL;
> >>+
> >>+		f = fdget(param.fd);
> >>+		if (!f.file)
> >>+			return -EBADF;
> >>+
> >>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>+		fdput(f);
> >>+
> >>+		if (IS_ERR(vfio_group))
> >>+			return PTR_ERR(vfio_group);
> >>+
> >>+		ret = -ENOENT;
> >
> >Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >possible a kernel could be built for a platform supporting multiple
> >IOMMU types.
> 
> Well, may make sense but I do not know to test that. The IOMMU type is a
> VFIO container property, not a group property and here (KVM) we only have
> groups.

Which, as mentioned previously, is broken.

> And calling  iommu_group_get_iommudata() is quite useless as it returns a
> void pointer... I could probably check that the release() callback is the
> one I set via iommu_group_set_iommudata() but there is no API to get it from
> a group.
> 
> And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and
> therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can
> the same kernel binary image work on both BOOK3S and embedded PPC? Where
> these other types can come from?
> 
> 
> >
> >>+		mutex_lock(&kv->lock);
> >>+
> >>+		list_for_each_entry(kvg, &kv->group_list, node) {
> >>+			int group_id;
> >>+			struct iommu_group *grp;
> >>+
> >>+			if (kvg->vfio_group != vfio_group)
> >>+				continue;
> >>+
> >>+			group_id = kvm_vfio_external_user_iommu_id(
> >>+					kvg->vfio_group);
> >>+			grp = iommu_group_get_by_id(group_id);
> >>+			if (!grp) {
> >>+				ret = -EFAULT;
> >>+				break;
> >>+			}
> >>+
> >>+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >>+					param.liobn, param.start_addr,
> >>+					grp);
> >>+			if (ret)
> >>+				iommu_group_put(grp);
> >>+			break;
> >>+		}
> >>+
> >>+		mutex_unlock(&kv->lock);
> >>+
> >>+		kvm_vfio_group_put_external_user(vfio_group);
> >>+
> >>+		return ret;
> >>+	}
> >>+#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> >>
> >>  	return -ENXIO;
> >>@@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
> >>+#endif
> >>  			return 0;
> >>  		}
> >>
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-03-07  6:00     ` David Gibson
@ 2016-03-10  8:33       ` Paul Mackerras
  -1 siblings, 0 replies; 92+ messages in thread
From: Paul Mackerras @ 2016-03-10  8:33 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Alex Williamson, kvm-ppc, kvm

On Mon, Mar 07, 2016 at 05:00:14PM +1100, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> > VFIO on sPAPR already implements guest memory pre-registration
> > when the entire guest RAM gets pinned. This can be used to translate
> > the physical address of a guest page containing the TCE list
> > from H_PUT_TCE_INDIRECT.
> > 
> > This makes use of the pre-registrered memory API to access TCE list
> > pages in order to avoid unnecessary locking on the KVM memory
> > reverse map.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> Ok.. so, what's the benefit of not having to lock the rmap?

It's not primarily about locking or not locking the rmap.  The point
is that when memory is pre-registered, we know that all of guest
memory is pinned and we have a flat array mapping GPA to HPA.  It's
simpler and quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry.  We were only locking the rmap entry to stop the page being
unmapped and reallocated to something else, but if memory is
pre-registered, it's all pinned, so it can't be reallocated.

Paul.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-03-10  8:33       ` Paul Mackerras
  0 siblings, 0 replies; 92+ messages in thread
From: Paul Mackerras @ 2016-03-10  8:33 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Alex Williamson, kvm-ppc, kvm

On Mon, Mar 07, 2016 at 05:00:14PM +1100, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> > VFIO on sPAPR already implements guest memory pre-registration
> > when the entire guest RAM gets pinned. This can be used to translate
> > the physical address of a guest page containing the TCE list
> > from H_PUT_TCE_INDIRECT.
> > 
> > This makes use of the pre-registrered memory API to access TCE list
> > pages in order to avoid unnecessary locking on the KVM memory
> > reverse map.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> Ok.. so, what's the benefit of not having to lock the rmap?

It's not primarily about locking or not locking the rmap.  The point
is that when memory is pre-registered, we know that all of guest
memory is pinned and we have a flat array mapping GPA to HPA.  It's
simpler and quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry.  We were only locking the rmap entry to stop the page being
unmapped and reallocated to something else, but if memory is
pre-registered, it's all pinned, so it can't be reallocated.

Paul.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-10  8:43     ` Paul Mackerras
  -1 siblings, 0 replies; 92+ messages in thread
From: Paul Mackerras @ 2016-03-10  8:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, David Gibson, kvm-ppc, kvm

On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using different
> cache-inhibited store instructions which is different from
> the virtual mode.
> 
> This defines and implements exchange_rm() callback. This does not
> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> exchange/exchange_rm are only to be used by KVM for VFIO.
> 
> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> 
> This replaces list_for_each_entry_rcu with its lockless version as
> from now on pnv_pci_ioda2_tce_invalidate() can be called in
> the real mode too.

[snip]

> @@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
>  }
>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>  
> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret;
> +
> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +
> +	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> +			(*direction == DMA_BIDIRECTIONAL)))
> +		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));

realmode_pfn_to_page can fail and return NULL, can't it?  You need to
handle that situation somehow.

Paul.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
@ 2016-03-10  8:43     ` Paul Mackerras
  0 siblings, 0 replies; 92+ messages in thread
From: Paul Mackerras @ 2016-03-10  8:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, David Gibson, kvm-ppc, kvm

On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using different
> cache-inhibited store instructions which is different from
> the virtual mode.
> 
> This defines and implements exchange_rm() callback. This does not
> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> exchange/exchange_rm are only to be used by KVM for VFIO.
> 
> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> 
> This replaces list_for_each_entry_rcu with its lockless version as
> from now on pnv_pci_ioda2_tce_invalidate() can be called in
> the real mode too.

[snip]

> @@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl)
>  }
>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>  
> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret;
> +
> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +
> +	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
> +			(*direction = DMA_BIDIRECTIONAL)))
> +		SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));

realmode_pfn_to_page can fail and return NULL, can't it?  You need to
handle that situation somehow.

Paul.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
  2016-03-07  3:41   ` Alexey Kardashevskiy
@ 2016-03-10  8:46     ` Paul Mackerras
  -1 siblings, 0 replies; 92+ messages in thread
From: Paul Mackerras @ 2016-03-10  8:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, David Gibson, kvm-ppc, kvm

On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using different
> cache-inhibited store instructions which is different from
> the virtual mode.

I suggest "In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode".

Also, the subject could make people think it's about the kernel xchg()
function defined in <asm/cmpxchg.h>.

Paul.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
@ 2016-03-10  8:46     ` Paul Mackerras
  0 siblings, 0 replies; 92+ messages in thread
From: Paul Mackerras @ 2016-03-10  8:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, David Gibson, kvm-ppc, kvm

On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using different
> cache-inhibited store instructions which is different from
> the virtual mode.

I suggest "In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode".

Also, the subject could make people think it's about the kernel xchg()
function defined in <asm/cmpxchg.h>.

Paul.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-10  5:21         ` David Gibson
@ 2016-03-10 23:09           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-10 23:09 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/10/2016 04:21 PM, David Gibson wrote:
> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>> linked to IOMMU groups by the user space.
>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>
>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>> is added which accepts:
>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>> TCE table;
>>>> - a LIOBN to assign to the found table.
>>>>
>>>> Before notifying KVM about new link, this check the group for being
>>>> registered with KVM device in order to release them at unexpected KVM
>>>> finish.
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>> link to a KVM module.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>   include/uapi/linux/kvm.h                   |   9 +++
>>>>   virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>>>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740..c0d3eb7 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,24 @@ Groups:
>>>>
>>>>   KVM_DEV_VFIO_GROUP attributes:
>>>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>
>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>> case it might be clearer to put them in a separate patch.
>>>
>>>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>> +	kvm_device_attr.addr points to a struct:
>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>> +			__u32	argsz;
>>>> +			__s32	fd;
>>>> +			__u32	liobn;
>>>> +			__u8	pad[4];
>>>> +			__u64	start_addr;
>>>> +		};
>>>> +		where
>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>> +		@fd is a file descriptor for a VFIO group;
>>>> +		@liobn is a logical bus id to be associated with the group;
>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>
>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>> this multiple times with different LIOBNs and the same IOMMU group?
>>
>>
>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>> and 64bit windows, this is why @start_addr is there.
>>
>>
>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>> index 1059846..dfa3488 100644
>>>> --- a/arch/powerpc/kvm/Kconfig
>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>   	select KVM
>>>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>> +	select KVM_VFIO if VFIO
>>>>   	---help---
>>>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>   	  in virtual machines on book3s_64 host processors.
>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>> index 7f7b6d8..71f577c 100644
>>>> --- a/arch/powerpc/kvm/Makefile
>>>> +++ b/arch/powerpc/kvm/Makefile
>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>   KVM := ../../../virt/kvm
>>>>
>>>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>> +		$(KVM)/eventfd.o
>>>
>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>> it (even though it didn't do anything until now) so that libvirt
>>> wouldn't choke when it finds it's not available.  Obviously the new
>>> ioctl needs to be only for the right IOMMU setup, but the device
>>> itself should be available always.
>>
>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>
>>
>>>>   CFLAGS_e500_mmu.o := -I.
>>>>   CFLAGS_e500_mmu_host.o := -I.
>>>> @@ -87,6 +87,9 @@ endif
>>>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>   	book3s_xics.o
>>>>
>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>> +	$(KVM)/vfio.o \
>>>> +
>>>>   kvm-book3s_64-module-objs += \
>>>>   	$(KVM)/kvm_main.o \
>>>>   	$(KVM)/eventfd.o \
>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>> index 19aa59b..63f188d 100644
>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>   #ifdef CONFIG_PPC_BOOK3S_64
>>>>   	case KVM_CAP_SPAPR_TCE:
>>>>   	case KVM_CAP_SPAPR_TCE_64:
>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>   	case KVM_CAP_PPC_RTAS:
>>>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 080ffbf..f1abbea 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>   #define  KVM_DEV_VFIO_GROUP			1
>>>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>
>>>>   enum kvm_device_type {
>>>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>   	KVM_DEV_TYPE_MAX,
>>>>   };
>>>>
>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>> +	__u32	argsz;
>>>> +	__s32	fd;
>>>> +	__u32	liobn;
>>>> +	__u8	pad[4];
>>>> +	__u64	start_addr;
>>>> +};
>>>> +
>>>>   /*
>>>>    * ioctls for VM fds
>>>>    */
>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>> index 1dd087d..87c771e 100644
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -20,6 +20,10 @@
>>>>   #include <linux/vfio.h>
>>>>   #include "vfio.h"
>>>>
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +#include <asm/kvm_ppc.h>
>>>> +#endif
>>>> +
>>>>   struct kvm_vfio_group {
>>>>   	struct list_head node;
>>>>   	struct vfio_group *vfio_group;
>>>> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>   	symbol_put(vfio_group_put_external_user);
>>>>   }
>>>>
>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>> +{
>>>> +	int (*fn)(struct vfio_group *);
>>>> +	int ret = -1;
>>>
>>> Should this be -ESOMETHING?
>>>
>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>> +	if (!fn)
>>>> +		return ret;
>>>> +
>>>> +	ret = fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>   {
>>>>   	long (*fn)(struct vfio_group *, unsigned long);
>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>>>   	mutex_unlock(&kv->lock);
>>>>   }
>>>>
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *vfio_group)
>>>
>>>
>>> Shouldn't this go in the same patch that introduced the attach
>>> function?
>>
>> Having less patches which touch different maintainers areas is better. I
>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>> table".
>>
>>
>>>
>>>> +{
>>>> +	int group_id;
>>>> +	struct iommu_group *grp;
>>>> +
>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>> +	grp = iommu_group_get_by_id(group_id);
>>>> +	if (grp) {
>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>> +		iommu_group_put(grp);
>>>> +	}
>>>> +}
>>>> +#endif
>>>> +
>>>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   {
>>>>   	struct kvm_vfio *kv = dev->private;
>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   				continue;
>>>>
>>>>   			list_del(&kvg->node);
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>
>>> Better to make a no-op version of the call than have to #ifdef at the
>>> callsite.
>>
>> It is questionable. A x86 reader may decide that
>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>> confused.
>>
>>
>>>
>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>> +					kvg->vfio_group);
>>>> +#endif
>>>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>   			kfree(kvg);
>>>>   			ret = 0;
>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   		kvm_vfio_update_coherency(dev);
>>>>
>>>>   		return ret;
>>>> +
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>> +		unsigned long minsz;
>>>> +		struct kvm_vfio *kv = dev->private;
>>>> +		struct vfio_group *vfio_group;
>>>> +		struct kvm_vfio_group *kvg;
>>>> +		struct fd f;
>>>> +
>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>> +				start_addr);
>>>> +
>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (param.argsz < minsz)
>>>> +			return -EINVAL;
>>>> +
>>>> +		f = fdget(param.fd);
>>>> +		if (!f.file)
>>>> +			return -EBADF;
>>>> +
>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>> +		fdput(f);
>>>> +
>>>> +		if (IS_ERR(vfio_group))
>>>> +			return PTR_ERR(vfio_group);
>>>> +
>>>> +		ret = -ENOENT;
>>>
>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>> possible a kernel could be built for a platform supporting multiple
>>> IOMMU types.
>>
>> Well, may make sense but I do not know to test that. The IOMMU type is a
>> VFIO container property, not a group property and here (KVM) we only have
>> groups.
>
> Which, as mentioned previously, is broken.

Which I am failing to follow you on this.

What I am trying to achieve here is pretty much referencing a group so it 
cannot be reused. Plus LIOBNs. Passing a container fd does not make much 
sense here as the VFIO device would walk through groups, reference them and 
that is it, there is no locking on VFIO containters and so far there was no 
need to teach KVM about containers.

What do I miss now?



>
>> And calling  iommu_group_get_iommudata() is quite useless as it returns a
>> void pointer... I could probably check that the release() callback is the
>> one I set via iommu_group_set_iommudata() but there is no API to get it from
>> a group.
>>
>> And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and
>> therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can
>> the same kernel binary image work on both BOOK3S and embedded PPC? Where
>> these other types can come from?
>>
>>
>>>
>>>> +		mutex_lock(&kv->lock);
>>>> +
>>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>>>> +			int group_id;
>>>> +			struct iommu_group *grp;
>>>> +
>>>> +			if (kvg->vfio_group != vfio_group)
>>>> +				continue;
>>>> +
>>>> +			group_id = kvm_vfio_external_user_iommu_id(
>>>> +					kvg->vfio_group);
>>>> +			grp = iommu_group_get_by_id(group_id);
>>>> +			if (!grp) {
>>>> +				ret = -EFAULT;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>>>> +					param.liobn, param.start_addr,
>>>> +					grp);
>>>> +			if (ret)
>>>> +				iommu_group_put(grp);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		mutex_unlock(&kv->lock);
>>>> +
>>>> +		kvm_vfio_group_put_external_user(vfio_group);
>>>> +
>>>> +		return ret;
>>>> +	}
>>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>>>   	}
>>>>
>>>>   	return -ENXIO;
>>>> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>>>   		switch (attr->attr) {
>>>>   		case KVM_DEV_VFIO_GROUP_ADD:
>>>>   		case KVM_DEV_VFIO_GROUP_DEL:
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
>>>> +#endif
>>>>   			return 0;
>>>>   		}
>>>>
>>>
>>
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-03-10 23:09           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-10 23:09 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/10/2016 04:21 PM, David Gibson wrote:
> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>> linked to IOMMU groups by the user space.
>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>
>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>> is added which accepts:
>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>> TCE table;
>>>> - a LIOBN to assign to the found table.
>>>>
>>>> Before notifying KVM about new link, this check the group for being
>>>> registered with KVM device in order to release them at unexpected KVM
>>>> finish.
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>> link to a KVM module.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>   include/uapi/linux/kvm.h                   |   9 +++
>>>>   virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>>>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740..c0d3eb7 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,24 @@ Groups:
>>>>
>>>>   KVM_DEV_VFIO_GROUP attributes:
>>>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>
>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>> case it might be clearer to put them in a separate patch.
>>>
>>>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>> +	kvm_device_attr.addr points to a struct:
>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>> +			__u32	argsz;
>>>> +			__s32	fd;
>>>> +			__u32	liobn;
>>>> +			__u8	pad[4];
>>>> +			__u64	start_addr;
>>>> +		};
>>>> +		where
>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>> +		@fd is a file descriptor for a VFIO group;
>>>> +		@liobn is a logical bus id to be associated with the group;
>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>
>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>> this multiple times with different LIOBNs and the same IOMMU group?
>>
>>
>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>> and 64bit windows, this is why @start_addr is there.
>>
>>
>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>> index 1059846..dfa3488 100644
>>>> --- a/arch/powerpc/kvm/Kconfig
>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>   	select KVM
>>>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>> +	select KVM_VFIO if VFIO
>>>>   	---help---
>>>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>   	  in virtual machines on book3s_64 host processors.
>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>> index 7f7b6d8..71f577c 100644
>>>> --- a/arch/powerpc/kvm/Makefile
>>>> +++ b/arch/powerpc/kvm/Makefile
>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>   KVM := ../../../virt/kvm
>>>>
>>>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>> +		$(KVM)/eventfd.o
>>>
>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>> it (even though it didn't do anything until now) so that libvirt
>>> wouldn't choke when it finds it's not available.  Obviously the new
>>> ioctl needs to be only for the right IOMMU setup, but the device
>>> itself should be available always.
>>
>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>
>>
>>>>   CFLAGS_e500_mmu.o := -I.
>>>>   CFLAGS_e500_mmu_host.o := -I.
>>>> @@ -87,6 +87,9 @@ endif
>>>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>   	book3s_xics.o
>>>>
>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>> +	$(KVM)/vfio.o \
>>>> +
>>>>   kvm-book3s_64-module-objs += \
>>>>   	$(KVM)/kvm_main.o \
>>>>   	$(KVM)/eventfd.o \
>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>> index 19aa59b..63f188d 100644
>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>   #ifdef CONFIG_PPC_BOOK3S_64
>>>>   	case KVM_CAP_SPAPR_TCE:
>>>>   	case KVM_CAP_SPAPR_TCE_64:
>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>   	case KVM_CAP_PPC_RTAS:
>>>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 080ffbf..f1abbea 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>   #define  KVM_DEV_VFIO_GROUP			1
>>>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>
>>>>   enum kvm_device_type {
>>>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>   	KVM_DEV_TYPE_MAX,
>>>>   };
>>>>
>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>> +	__u32	argsz;
>>>> +	__s32	fd;
>>>> +	__u32	liobn;
>>>> +	__u8	pad[4];
>>>> +	__u64	start_addr;
>>>> +};
>>>> +
>>>>   /*
>>>>    * ioctls for VM fds
>>>>    */
>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>> index 1dd087d..87c771e 100644
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -20,6 +20,10 @@
>>>>   #include <linux/vfio.h>
>>>>   #include "vfio.h"
>>>>
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +#include <asm/kvm_ppc.h>
>>>> +#endif
>>>> +
>>>>   struct kvm_vfio_group {
>>>>   	struct list_head node;
>>>>   	struct vfio_group *vfio_group;
>>>> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>   	symbol_put(vfio_group_put_external_user);
>>>>   }
>>>>
>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>> +{
>>>> +	int (*fn)(struct vfio_group *);
>>>> +	int ret = -1;
>>>
>>> Should this be -ESOMETHING?
>>>
>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>> +	if (!fn)
>>>> +		return ret;
>>>> +
>>>> +	ret = fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>   {
>>>>   	long (*fn)(struct vfio_group *, unsigned long);
>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>>>   	mutex_unlock(&kv->lock);
>>>>   }
>>>>
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *vfio_group)
>>>
>>>
>>> Shouldn't this go in the same patch that introduced the attach
>>> function?
>>
>> Having less patches which touch different maintainers areas is better. I
>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>> table".
>>
>>
>>>
>>>> +{
>>>> +	int group_id;
>>>> +	struct iommu_group *grp;
>>>> +
>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>> +	grp = iommu_group_get_by_id(group_id);
>>>> +	if (grp) {
>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>> +		iommu_group_put(grp);
>>>> +	}
>>>> +}
>>>> +#endif
>>>> +
>>>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   {
>>>>   	struct kvm_vfio *kv = dev->private;
>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   				continue;
>>>>
>>>>   			list_del(&kvg->node);
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>
>>> Better to make a no-op version of the call than have to #ifdef at the
>>> callsite.
>>
>> It is questionable. A x86 reader may decide that
>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>> confused.
>>
>>
>>>
>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>> +					kvg->vfio_group);
>>>> +#endif
>>>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>   			kfree(kvg);
>>>>   			ret = 0;
>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   		kvm_vfio_update_coherency(dev);
>>>>
>>>>   		return ret;
>>>> +
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>> +		unsigned long minsz;
>>>> +		struct kvm_vfio *kv = dev->private;
>>>> +		struct vfio_group *vfio_group;
>>>> +		struct kvm_vfio_group *kvg;
>>>> +		struct fd f;
>>>> +
>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>> +				start_addr);
>>>> +
>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (param.argsz < minsz)
>>>> +			return -EINVAL;
>>>> +
>>>> +		f = fdget(param.fd);
>>>> +		if (!f.file)
>>>> +			return -EBADF;
>>>> +
>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>> +		fdput(f);
>>>> +
>>>> +		if (IS_ERR(vfio_group))
>>>> +			return PTR_ERR(vfio_group);
>>>> +
>>>> +		ret = -ENOENT;
>>>
>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>> possible a kernel could be built for a platform supporting multiple
>>> IOMMU types.
>>
>> Well, may make sense but I do not know to test that. The IOMMU type is a
>> VFIO container property, not a group property and here (KVM) we only have
>> groups.
>
> Which, as mentioned previously, is broken.

Which I am failing to follow you on this.

What I am trying to achieve here is pretty much referencing a group so it 
cannot be reused. Plus LIOBNs. Passing a container fd does not make much 
sense here as the VFIO device would walk through groups, reference them and 
that is it, there is no locking on VFIO containters and so far there was no 
need to teach KVM about containers.

What do I miss now?



>
>> And calling  iommu_group_get_iommudata() is quite useless as it returns a
>> void pointer... I could probably check that the release() callback is the
>> one I set via iommu_group_set_iommudata() but there is no API to get it from
>> a group.
>>
>> And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and
>> therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can
>> the same kernel binary image work on both BOOK3S and embedded PPC? Where
>> these other types can come from?
>>
>>
>>>
>>>> +		mutex_lock(&kv->lock);
>>>> +
>>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>>>> +			int group_id;
>>>> +			struct iommu_group *grp;
>>>> +
>>>> +			if (kvg->vfio_group != vfio_group)
>>>> +				continue;
>>>> +
>>>> +			group_id = kvm_vfio_external_user_iommu_id(
>>>> +					kvg->vfio_group);
>>>> +			grp = iommu_group_get_by_id(group_id);
>>>> +			if (!grp) {
>>>> +				ret = -EFAULT;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>>>> +					param.liobn, param.start_addr,
>>>> +					grp);
>>>> +			if (ret)
>>>> +				iommu_group_put(grp);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		mutex_unlock(&kv->lock);
>>>> +
>>>> +		kvm_vfio_group_put_external_user(vfio_group);
>>>> +
>>>> +		return ret;
>>>> +	}
>>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>>>   	}
>>>>
>>>>   	return -ENXIO;
>>>> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>>>   		switch (attr->attr) {
>>>>   		case KVM_DEV_VFIO_GROUP_ADD:
>>>>   		case KVM_DEV_VFIO_GROUP_DEL:
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
>>>> +#endif
>>>>   			return 0;
>>>>   		}
>>>>
>>>
>>
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
  2016-03-10  8:33       ` Paul Mackerras
@ 2016-03-10 23:42         ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-10 23:42 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexey Kardashevskiy, linuxppc-dev, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]

On Thu, Mar 10, 2016 at 07:33:05PM +1100, Paul Mackerras wrote:
> On Mon, Mar 07, 2016 at 05:00:14PM +1100, David Gibson wrote:
> > On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> > > VFIO on sPAPR already implements guest memory pre-registration
> > > when the entire guest RAM gets pinned. This can be used to translate
> > > the physical address of a guest page containing the TCE list
> > > from H_PUT_TCE_INDIRECT.
> > > 
> > > This makes use of the pre-registrered memory API to access TCE list
> > > pages in order to avoid unnecessary locking on the KVM memory
> > > reverse map.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > Ok.. so, what's the benefit of not having to lock the rmap?
> 
> It's not primarily about locking or not locking the rmap.  The point
> is that when memory is pre-registered, we know that all of guest
> memory is pinned and we have a flat array mapping GPA to HPA.  It's
> simpler and quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry.  We were only locking the rmap entry to stop the page being
> unmapped and reallocated to something else, but if memory is
> pre-registered, it's all pinned, so it can't be reallocated.

Ok, that makes sense.

Alexey, can you fold some of that rationale into the commit message?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
@ 2016-03-10 23:42         ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-10 23:42 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexey Kardashevskiy, linuxppc-dev, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]

On Thu, Mar 10, 2016 at 07:33:05PM +1100, Paul Mackerras wrote:
> On Mon, Mar 07, 2016 at 05:00:14PM +1100, David Gibson wrote:
> > On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote:
> > > VFIO on sPAPR already implements guest memory pre-registration
> > > when the entire guest RAM gets pinned. This can be used to translate
> > > the physical address of a guest page containing the TCE list
> > > from H_PUT_TCE_INDIRECT.
> > > 
> > > This makes use of the pre-registrered memory API to access TCE list
> > > pages in order to avoid unnecessary locking on the KVM memory
> > > reverse map.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > Ok.. so, what's the benefit of not having to lock the rmap?
> 
> It's not primarily about locking or not locking the rmap.  The point
> is that when memory is pre-registered, we know that all of guest
> memory is pinned and we have a flat array mapping GPA to HPA.  It's
> simpler and quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry.  We were only locking the rmap entry to stop the page being
> unmapped and reallocated to something else, but if memory is
> pre-registered, it's all pinned, so it can't be reallocated.

Ok, that makes sense.

Alexey, can you fold some of that rationale into the commit message?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
  2016-03-10  5:18         ` David Gibson
@ 2016-03-11  2:15           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-11  2:15 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/10/2016 04:18 PM, David Gibson wrote:
> On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote:
>> On 03/08/2016 10:08 PM, David Gibson wrote:
>>> On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> Both real and virtual modes are supported. The kernel tries to
>>>> handle a TCE request in the real mode, if fails it passes the request
>>>> to the virtual mode to complete the operation. If it a virtual mode
>>>> handler fails, the request is passed to user space; this is not expected
>>>> to happen ever though.
>>>
>>> Well... not expect to happen with a qemu which uses this.  Presumably
>>> it will fall back to userspace routinely if you have an old qemu that
>>> doesn't add the liobn mappings.
>>
>>
>> Ah. Ok, thanks, I'll add this to the commit log.
>
> Ok.
>
>>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>>>> user API functions are required for this patch.
>>>
>>> I'm not sure what you mean by "trampoline" here.
>>
>> For example, look at kvm_vfio_group_get_external_user. It calls
>> symbol_get(vfio_group_get_external_user) and then calls a function via the
>> returned pointer.
>>
>> Is there a better word for this?
>
> Uh.. probably although I don't immediately know what.  "Trampoline"
> usually refers to code on the stack used for bouncing places, which
> isn't what this resembles.

"Dynamic wrapper"?



>>>> This uses a VFIO KVM device to associate a logical bus number (LIOBN)
>>>> with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
>>>> requests.
>>>
>>> Group fd?  Or container fd?  The group fd wouldn't make a lot of
>>> sense.
>>
>>
>> Group. KVM has no idea about containers.
>
> That's not going to fly.  Having a liobn registered against just one
> group in a container makes no sense at all.  Conceptually, if not
> physically, the container shares a single set of TCE tables.  If
> handling that means teaching KVM the concept of containers, then so be
> it.
>
> Btw, I'm not sure yet if extending the existing vfio kvm device to
> make the vfio<->kvm linkages makes sense.  I think the reason some x86
> machines need that is quite different from how we're using it for
> Power.  I haven't got a clear enough picture yet to be sure either
> way.
>
> The other option that would seem likely to me would be a "bind VFIO
> container" ioctl() on the fd associated with a kernel accelerated TCE table.


Oh, I just noticed this response. I need to digest it. Looks like this is 
going to take other 2 years to upstream...


>>>> To make use of the feature, the user space has to create a guest view
>>>> of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
>>>> then associate a LIOBN with this table via VFIO KVM device,
>>>> a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
>>>> the next patch).
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>
>>> Is that with or without DDW (i.e. with or without a 64-bit DMA window)?
>>
>>
>> Without DDW, I should have mentioned this. The patch is from the times when
>> there was no DDW :(
>
> Ok.
>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>   arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
>>>>   arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
>>>>   2 files changed, 370 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 7965fc7..9417d12 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -33,6 +33,7 @@
>>>>   #include <asm/kvm_ppc.h>
>>>>   #include <asm/kvm_book3s.h>
>>>>   #include <asm/mmu-hash64.h>
>>>> +#include <asm/mmu_context.h>
>>>>   #include <asm/hvcall.h>
>>>>   #include <asm/synch.h>
>>>>   #include <asm/ppc-opcode.h>
>>>> @@ -317,11 +318,161 @@ fail:
>>>>   	return ret;
>>>>   }
>>>>
>>>> +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
>>>> +		unsigned long entry)
>>>> +{
>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> +	if (!pua)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mem = mm_iommu_lookup(*pua, pgsize);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mm_iommu_mapped_dec(mem);
>>>> +
>>>> +	*pua = 0;
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
>>>> +		unsigned long entry)
>>>> +{
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +	unsigned long hpa = 0;
>>>> +
>>>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (dir == DMA_NONE)
>>>> +		return H_SUCCESS;
>>>> +
>>>> +	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
>>>> +}
>>>> +
>>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>>>> +		unsigned long entry, unsigned long gpa,
>>>> +		enum dma_data_direction dir)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> +	if (!pua)
>>>> +		return H_HARDWARE;
>>>
>>> H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
>>> supplied a bad physical address, doesn't it?
>>
>> Well, may be. I'll change. If it not H_TOO_HARD, it does not make any
>> difference after all :)
>>
>>
>>
>>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (mm_iommu_mapped_inc(mem))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>> +	if (ret) {
>>>> +		mm_iommu_mapped_dec(mem);
>>>> +		return H_TOO_HARD;
>>>> +	}
>>>> +
>>>> +	if (dir != DMA_NONE)
>>>> +		kvmppc_tce_iommu_mapped_dec(tbl, entry);
>>>> +
>>>> +	*pua = ua;
>>>
>>> IIUC this means you have a copy of the UA for every group attached to
>>> the TCE table, but they'll all be the same. Any way to avoid that
>>> duplication?
>>
>> It is for every container, not a group. On P8, I allow multiple groups to go
>> to the same container, that means that a container has one or two
>> iommu_table, and each iommu_table has this "ua" list but since tables are
>> different (window size, page size, content), these "ua" arrays are also
>> different.
>
> Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups
> attached to the stt, and each one seems to update pua.
>
> Or is that what the if (kg->tbl == tbltmp) continue; is supposed to
> avoid?  In which case what ensures that the stt->groups list is
> ordered by tbl pointer?


Nothing. In the normal case (POWER8 IODA2) all groups on the same liobn 
have the same iommu_table, so the first group's one gets updated, other do 
not but it is ok as they use the same table.

In a bad case (POWER7 IODA1, multiple containers per LIOBN) the same @ua 
can be updated more than once. Well, not a huge loss.


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
@ 2016-03-11  2:15           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-11  2:15 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/10/2016 04:18 PM, David Gibson wrote:
> On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote:
>> On 03/08/2016 10:08 PM, David Gibson wrote:
>>> On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> Both real and virtual modes are supported. The kernel tries to
>>>> handle a TCE request in the real mode, if fails it passes the request
>>>> to the virtual mode to complete the operation. If it a virtual mode
>>>> handler fails, the request is passed to user space; this is not expected
>>>> to happen ever though.
>>>
>>> Well... not expect to happen with a qemu which uses this.  Presumably
>>> it will fall back to userspace routinely if you have an old qemu that
>>> doesn't add the liobn mappings.
>>
>>
>> Ah. Ok, thanks, I'll add this to the commit log.
>
> Ok.
>
>>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>>>> user API functions are required for this patch.
>>>
>>> I'm not sure what you mean by "trampoline" here.
>>
>> For example, look at kvm_vfio_group_get_external_user. It calls
>> symbol_get(vfio_group_get_external_user) and then calls a function via the
>> returned pointer.
>>
>> Is there a better word for this?
>
> Uh.. probably although I don't immediately know what.  "Trampoline"
> usually refers to code on the stack used for bouncing places, which
> isn't what this resembles.

"Dynamic wrapper"?



>>>> This uses a VFIO KVM device to associate a logical bus number (LIOBN)
>>>> with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
>>>> requests.
>>>
>>> Group fd?  Or container fd?  The group fd wouldn't make a lot of
>>> sense.
>>
>>
>> Group. KVM has no idea about containers.
>
> That's not going to fly.  Having a liobn registered against just one
> group in a container makes no sense at all.  Conceptually, if not
> physically, the container shares a single set of TCE tables.  If
> handling that means teaching KVM the concept of containers, then so be
> it.
>
> Btw, I'm not sure yet if extending the existing vfio kvm device to
> make the vfio<->kvm linkages makes sense.  I think the reason some x86
> machines need that is quite different from how we're using it for
> Power.  I haven't got a clear enough picture yet to be sure either
> way.
>
> The other option that would seem likely to me would be a "bind VFIO
> container" ioctl() on the fd associated with a kernel accelerated TCE table.


Oh, I just noticed this response. I need to digest it. Looks like this is 
going to take other 2 years to upstream...


>>>> To make use of the feature, the user space has to create a guest view
>>>> of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
>>>> then associate a LIOBN with this table via VFIO KVM device,
>>>> a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
>>>> the next patch).
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>
>>> Is that with or without DDW (i.e. with or without a 64-bit DMA window)?
>>
>>
>> Without DDW, I should have mentioned this. The patch is from the times when
>> there was no DDW :(
>
> Ok.
>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>   arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
>>>>   arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
>>>>   2 files changed, 370 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 7965fc7..9417d12 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -33,6 +33,7 @@
>>>>   #include <asm/kvm_ppc.h>
>>>>   #include <asm/kvm_book3s.h>
>>>>   #include <asm/mmu-hash64.h>
>>>> +#include <asm/mmu_context.h>
>>>>   #include <asm/hvcall.h>
>>>>   #include <asm/synch.h>
>>>>   #include <asm/ppc-opcode.h>
>>>> @@ -317,11 +318,161 @@ fail:
>>>>   	return ret;
>>>>   }
>>>>
>>>> +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
>>>> +		unsigned long entry)
>>>> +{
>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> +	if (!pua)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mem = mm_iommu_lookup(*pua, pgsize);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mm_iommu_mapped_dec(mem);
>>>> +
>>>> +	*pua = 0;
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
>>>> +		unsigned long entry)
>>>> +{
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +	unsigned long hpa = 0;
>>>> +
>>>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (dir = DMA_NONE)
>>>> +		return H_SUCCESS;
>>>> +
>>>> +	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
>>>> +}
>>>> +
>>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>>>> +		unsigned long entry, unsigned long gpa,
>>>> +		enum dma_data_direction dir)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> +	if (!pua)
>>>> +		return H_HARDWARE;
>>>
>>> H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
>>> supplied a bad physical address, doesn't it?
>>
>> Well, may be. I'll change. If it not H_TOO_HARD, it does not make any
>> difference after all :)
>>
>>
>>
>>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
>>>> +	if (!mem)
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (mm_iommu_mapped_inc(mem))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>> +	if (ret) {
>>>> +		mm_iommu_mapped_dec(mem);
>>>> +		return H_TOO_HARD;
>>>> +	}
>>>> +
>>>> +	if (dir != DMA_NONE)
>>>> +		kvmppc_tce_iommu_mapped_dec(tbl, entry);
>>>> +
>>>> +	*pua = ua;
>>>
>>> IIUC this means you have a copy of the UA for every group attached to
>>> the TCE table, but they'll all be the same. Any way to avoid that
>>> duplication?
>>
>> It is for every container, not a group. On P8, I allow multiple groups to go
>> to the same container, that means that a container has one or two
>> iommu_table, and each iommu_table has this "ua" list but since tables are
>> different (window size, page size, content), these "ua" arrays are also
>> different.
>
> Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups
> attached to the stt, and each one seems to update pua.
>
> Or is that what the if (kg->tbl = tbltmp) continue; is supposed to
> avoid?  In which case what ensures that the stt->groups list is
> ordered by tbl pointer?


Nothing. In the normal case (POWER8 IODA2) all groups on the same liobn 
have the same iommu_table, so the first group's one gets updated, other do 
not but it is ok as they use the same table.

In a bad case (POWER7 IODA1, multiple containers per LIOBN) the same @ua 
can be updated more than once. Well, not a huge loss.


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
  2016-03-11  2:15           ` Alexey Kardashevskiy
@ 2016-03-15  6:00             ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-15  6:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 8639 bytes --]

On Fri, Mar 11, 2016 at 01:15:20PM +1100, Alexey Kardashevskiy wrote:
> On 03/10/2016 04:18 PM, David Gibson wrote:
> >On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/08/2016 10:08 PM, David Gibson wrote:
> >>>On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
> >>>>This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>>and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>>>without passing them to user space which saves time on switching
> >>>>to user space and back.
> >>>>
> >>>>Both real and virtual modes are supported. The kernel tries to
> >>>>handle a TCE request in the real mode, if fails it passes the request
> >>>>to the virtual mode to complete the operation. If it a virtual mode
> >>>>handler fails, the request is passed to user space; this is not expected
> >>>>to happen ever though.
> >>>
> >>>Well... not expect to happen with a qemu which uses this.  Presumably
> >>>it will fall back to userspace routinely if you have an old qemu that
> >>>doesn't add the liobn mappings.
> >>
> >>
> >>Ah. Ok, thanks, I'll add this to the commit log.
> >
> >Ok.
> >
> >>>>The first user of this is VFIO on POWER. Trampolines to the VFIO external
> >>>>user API functions are required for this patch.
> >>>
> >>>I'm not sure what you mean by "trampoline" here.
> >>
> >>For example, look at kvm_vfio_group_get_external_user. It calls
> >>symbol_get(vfio_group_get_external_user) and then calls a function via the
> >>returned pointer.
> >>
> >>Is there a better word for this?
> >
> >Uh.. probably although I don't immediately know what.  "Trampoline"
> >usually refers to code on the stack used for bouncing places, which
> >isn't what this resembles.
> 
> "Dynamic wrapper"?

Sure, that'll do.

> >>>>This uses a VFIO KVM device to associate a logical bus number (LIOBN)
> >>>>with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
> >>>>requests.
> >>>
> >>>Group fd?  Or container fd?  The group fd wouldn't make a lot of
> >>>sense.
> >>
> >>
> >>Group. KVM has no idea about containers.
> >
> >That's not going to fly.  Having a liobn registered against just one
> >group in a container makes no sense at all.  Conceptually, if not
> >physically, the container shares a single set of TCE tables.  If
> >handling that means teaching KVM the concept of containers, then so be
> >it.
> >
> >Btw, I'm not sure yet if extending the existing vfio kvm device to
> >make the vfio<->kvm linkages makes sense.  I think the reason some x86
> >machines need that is quite different from how we're using it for
> >Power.  I haven't got a clear enough picture yet to be sure either
> >way.
> >
> >The other option that would seem likely to me would be a "bind VFIO
> >container" ioctl() on the fd associated with a kernel accelerated TCE table.
> 
> 
> Oh, I just noticed this response. I need to digest it. Looks like this is
> going to take other 2 years to upstream...
> 
> 
> >>>>To make use of the feature, the user space has to create a guest view
> >>>>of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
> >>>>then associate a LIOBN with this table via VFIO KVM device,
> >>>>a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
> >>>>the next patch).
> >>>>
> >>>>Tests show that this patch increases transmission speed from 220MB/s
> >>>>to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>
> >>>Is that with or without DDW (i.e. with or without a 64-bit DMA window)?
> >>
> >>
> >>Without DDW, I should have mentioned this. The patch is from the times when
> >>there was no DDW :(
> >
> >Ok.
> >
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>  arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
> >>>>  2 files changed, 370 insertions(+)
> >>>>
> >>>>diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>>index 7965fc7..9417d12 100644
> >>>>--- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>>+++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>>@@ -33,6 +33,7 @@
> >>>>  #include <asm/kvm_ppc.h>
> >>>>  #include <asm/kvm_book3s.h>
> >>>>  #include <asm/mmu-hash64.h>
> >>>>+#include <asm/mmu_context.h>
> >>>>  #include <asm/hvcall.h>
> >>>>  #include <asm/synch.h>
> >>>>  #include <asm/ppc-opcode.h>
> >>>>@@ -317,11 +318,161 @@ fail:
> >>>>  	return ret;
> >>>>  }
> >>>>
> >>>>+static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
> >>>>+		unsigned long entry)
> >>>>+{
> >>>>+	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>>>+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >>>>+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>>+
> >>>>+	if (!pua)
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	mem = mm_iommu_lookup(*pua, pgsize);
> >>>>+	if (!mem)
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	mm_iommu_mapped_dec(mem);
> >>>>+
> >>>>+	*pua = 0;
> >>>>+
> >>>>+	return H_SUCCESS;
> >>>>+}
> >>>>+
> >>>>+static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
> >>>>+		unsigned long entry)
> >>>>+{
> >>>>+	enum dma_data_direction dir = DMA_NONE;
> >>>>+	unsigned long hpa = 0;
> >>>>+
> >>>>+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	if (dir == DMA_NONE)
> >>>>+		return H_SUCCESS;
> >>>>+
> >>>>+	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
> >>>>+}
> >>>>+
> >>>>+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >>>>+		unsigned long entry, unsigned long gpa,
> >>>>+		enum dma_data_direction dir)
> >>>>+{
> >>>>+	long ret;
> >>>>+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>>+	struct mm_iommu_table_group_mem_t *mem;
> >>>>+
> >>>>+	if (!pua)
> >>>>+		return H_HARDWARE;
> >>>
> >>>H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
> >>>supplied a bad physical address, doesn't it?
> >>
> >>Well, may be. I'll change. If it not H_TOO_HARD, it does not make any
> >>difference after all :)
> >>
> >>
> >>
> >>>>+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
> >>>>+	if (!mem)
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	if (mm_iommu_mapped_inc(mem))
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >>>>+	if (ret) {
> >>>>+		mm_iommu_mapped_dec(mem);
> >>>>+		return H_TOO_HARD;
> >>>>+	}
> >>>>+
> >>>>+	if (dir != DMA_NONE)
> >>>>+		kvmppc_tce_iommu_mapped_dec(tbl, entry);
> >>>>+
> >>>>+	*pua = ua;
> >>>
> >>>IIUC this means you have a copy of the UA for every group attached to
> >>>the TCE table, but they'll all be the same. Any way to avoid that
> >>>duplication?
> >>
> >>It is for every container, not a group. On P8, I allow multiple groups to go
> >>to the same container, that means that a container has one or two
> >>iommu_table, and each iommu_table has this "ua" list but since tables are
> >>different (window size, page size, content), these "ua" arrays are also
> >>different.
> >
> >Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups
> >attached to the stt, and each one seems to update pua.
> >
> >Or is that what the if (kg->tbl == tbltmp) continue; is supposed to
> >avoid?  In which case what ensures that the stt->groups list is
> >ordered by tbl pointer?
> 
> 
> Nothing. In the normal case (POWER8 IODA2) all groups on the same liobn have
> the same iommu_table, so the first group's one gets updated, other do not
> but it is ok as they use the same table.

Right, which is another indication that group is the wrong concept to
use here.

> In a bad case (POWER7 IODA1, multiple containers per LIOBN) the same @ua can
> be updated more than once. Well, not a huge loss.

Ugh.. this really seems to be based on knowing the specific cases we
have in practice, rather than writing code that's correct based on ly
on the properties that the objects are defined to have.  The latter
approach will make for much more robust and extensible code.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
@ 2016-03-15  6:00             ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-15  6:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 8639 bytes --]

On Fri, Mar 11, 2016 at 01:15:20PM +1100, Alexey Kardashevskiy wrote:
> On 03/10/2016 04:18 PM, David Gibson wrote:
> >On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/08/2016 10:08 PM, David Gibson wrote:
> >>>On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote:
> >>>>This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>>and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >>>>without passing them to user space which saves time on switching
> >>>>to user space and back.
> >>>>
> >>>>Both real and virtual modes are supported. The kernel tries to
> >>>>handle a TCE request in the real mode, if fails it passes the request
> >>>>to the virtual mode to complete the operation. If it a virtual mode
> >>>>handler fails, the request is passed to user space; this is not expected
> >>>>to happen ever though.
> >>>
> >>>Well... not expect to happen with a qemu which uses this.  Presumably
> >>>it will fall back to userspace routinely if you have an old qemu that
> >>>doesn't add the liobn mappings.
> >>
> >>
> >>Ah. Ok, thanks, I'll add this to the commit log.
> >
> >Ok.
> >
> >>>>The first user of this is VFIO on POWER. Trampolines to the VFIO external
> >>>>user API functions are required for this patch.
> >>>
> >>>I'm not sure what you mean by "trampoline" here.
> >>
> >>For example, look at kvm_vfio_group_get_external_user. It calls
> >>symbol_get(vfio_group_get_external_user) and then calls a function via the
> >>returned pointer.
> >>
> >>Is there a better word for this?
> >
> >Uh.. probably although I don't immediately know what.  "Trampoline"
> >usually refers to code on the stack used for bouncing places, which
> >isn't what this resembles.
> 
> "Dynamic wrapper"?

Sure, that'll do.

> >>>>This uses a VFIO KVM device to associate a logical bus number (LIOBN)
> >>>>with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap
> >>>>requests.
> >>>
> >>>Group fd?  Or container fd?  The group fd wouldn't make a lot of
> >>>sense.
> >>
> >>
> >>Group. KVM has no idea about containers.
> >
> >That's not going to fly.  Having a liobn registered against just one
> >group in a container makes no sense at all.  Conceptually, if not
> >physically, the container shares a single set of TCE tables.  If
> >handling that means teaching KVM the concept of containers, then so be
> >it.
> >
> >Btw, I'm not sure yet if extending the existing vfio kvm device to
> >make the vfio<->kvm linkages makes sense.  I think the reason some x86
> >machines need that is quite different from how we're using it for
> >Power.  I haven't got a clear enough picture yet to be sure either
> >way.
> >
> >The other option that would seem likely to me would be a "bind VFIO
> >container" ioctl() on the fd associated with a kernel accelerated TCE table.
> 
> 
> Oh, I just noticed this response. I need to digest it. Looks like this is
> going to take other 2 years to upstream...
> 
> 
> >>>>To make use of the feature, the user space has to create a guest view
> >>>>of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and
> >>>>then associate a LIOBN with this table via VFIO KVM device,
> >>>>a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in
> >>>>the next patch).
> >>>>
> >>>>Tests show that this patch increases transmission speed from 220MB/s
> >>>>to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>
> >>>Is that with or without DDW (i.e. with or without a 64-bit DMA window)?
> >>
> >>
> >>Without DDW, I should have mentioned this. The patch is from the times when
> >>there was no DDW :(
> >
> >Ok.
> >
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>  arch/powerpc/kvm/book3s_64_vio.c    | 184 +++++++++++++++++++++++++++++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++
> >>>>  2 files changed, 370 insertions(+)
> >>>>
> >>>>diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>>index 7965fc7..9417d12 100644
> >>>>--- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>>+++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>>@@ -33,6 +33,7 @@
> >>>>  #include <asm/kvm_ppc.h>
> >>>>  #include <asm/kvm_book3s.h>
> >>>>  #include <asm/mmu-hash64.h>
> >>>>+#include <asm/mmu_context.h>
> >>>>  #include <asm/hvcall.h>
> >>>>  #include <asm/synch.h>
> >>>>  #include <asm/ppc-opcode.h>
> >>>>@@ -317,11 +318,161 @@ fail:
> >>>>  	return ret;
> >>>>  }
> >>>>
> >>>>+static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl,
> >>>>+		unsigned long entry)
> >>>>+{
> >>>>+	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>>>+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >>>>+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>>+
> >>>>+	if (!pua)
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	mem = mm_iommu_lookup(*pua, pgsize);
> >>>>+	if (!mem)
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	mm_iommu_mapped_dec(mem);
> >>>>+
> >>>>+	*pua = 0;
> >>>>+
> >>>>+	return H_SUCCESS;
> >>>>+}
> >>>>+
> >>>>+static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl,
> >>>>+		unsigned long entry)
> >>>>+{
> >>>>+	enum dma_data_direction dir = DMA_NONE;
> >>>>+	unsigned long hpa = 0;
> >>>>+
> >>>>+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	if (dir == DMA_NONE)
> >>>>+		return H_SUCCESS;
> >>>>+
> >>>>+	return kvmppc_tce_iommu_mapped_dec(tbl, entry);
> >>>>+}
> >>>>+
> >>>>+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >>>>+		unsigned long entry, unsigned long gpa,
> >>>>+		enum dma_data_direction dir)
> >>>>+{
> >>>>+	long ret;
> >>>>+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>>+	struct mm_iommu_table_group_mem_t *mem;
> >>>>+
> >>>>+	if (!pua)
> >>>>+		return H_HARDWARE;
> >>>
> >>>H_HARDWARE?  Or H_PARAMETER?  This essentially means the guest has
> >>>supplied a bad physical address, doesn't it?
> >>
> >>Well, may be. I'll change. If it not H_TOO_HARD, it does not make any
> >>difference after all :)
> >>
> >>
> >>
> >>>>+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift);
> >>>>+	if (!mem)
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	if (mm_iommu_mapped_inc(mem))
> >>>>+		return H_HARDWARE;
> >>>>+
> >>>>+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >>>>+	if (ret) {
> >>>>+		mm_iommu_mapped_dec(mem);
> >>>>+		return H_TOO_HARD;
> >>>>+	}
> >>>>+
> >>>>+	if (dir != DMA_NONE)
> >>>>+		kvmppc_tce_iommu_mapped_dec(tbl, entry);
> >>>>+
> >>>>+	*pua = ua;
> >>>
> >>>IIUC this means you have a copy of the UA for every group attached to
> >>>the TCE table, but they'll all be the same. Any way to avoid that
> >>>duplication?
> >>
> >>It is for every container, not a group. On P8, I allow multiple groups to go
> >>to the same container, that means that a container has one or two
> >>iommu_table, and each iommu_table has this "ua" list but since tables are
> >>different (window size, page size, content), these "ua" arrays are also
> >>different.
> >
> >Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups
> >attached to the stt, and each one seems to update pua.
> >
> >Or is that what the if (kg->tbl == tbltmp) continue; is supposed to
> >avoid?  In which case what ensures that the stt->groups list is
> >ordered by tbl pointer?
> 
> 
> Nothing. In the normal case (POWER8 IODA2) all groups on the same liobn have
> the same iommu_table, so the first group's one gets updated, other do not
> but it is ok as they use the same table.

Right, which is another indication that group is the wrong concept to
use here.

> In a bad case (POWER7 IODA1, multiple containers per LIOBN) the same @ua can
> be updated more than once. Well, not a huge loss.

Ugh.. this really seems to be based on knowing the specific cases we
have in practice, rather than writing code that's correct based on ly
on the properties that the objects are defined to have.  The latter
approach will make for much more robust and extensible code.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-10 23:09           ` Alexey Kardashevskiy
@ 2016-03-15  6:04             ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-15  6:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 12570 bytes --]

On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> On 03/10/2016 04:21 PM, David Gibson wrote:
> >On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>identifier. LIOBNs are made up, advertised to guest systems and
> >>>>linked to IOMMU groups by the user space.
> >>>>In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>to tell KVM the information about the LIOBN-to-group mapping.
> >>>>
> >>>>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>is added which accepts:
> >>>>- a VFIO group fd and IO base address to find the actual hardware
> >>>>TCE table;
> >>>>- a LIOBN to assign to the found table.
> >>>>
> >>>>Before notifying KVM about new link, this check the group for being
> >>>>registered with KVM device in order to release them at unexpected KVM
> >>>>finish.
> >>>>
> >>>>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>space.
> >>>>
> >>>>While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>link to a KVM module.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
> >>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>
> >>>>diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>index ef51740..c0d3eb7 100644
> >>>>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>@@ -16,7 +16,24 @@ Groups:
> >>>>
> >>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>+	for the VFIO group.
> >>>
> >>>AFAICT these changes are accurate for VFIO as it is already, in which
> >>>case it might be clearer to put them in a separate patch.
> >>>
> >>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>+	for the VFIO group.
> >>>>
> >>>>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>-for the VFIO group.
> >>>>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>+	kvm_device_attr.addr points to a struct:
> >>>>+		struct kvm_vfio_spapr_tce_liobn {
> >>>>+			__u32	argsz;
> >>>>+			__s32	fd;
> >>>>+			__u32	liobn;
> >>>>+			__u8	pad[4];
> >>>>+			__u64	start_addr;
> >>>>+		};
> >>>>+		where
> >>>>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>+		@fd is a file descriptor for a VFIO group;
> >>>>+		@liobn is a logical bus id to be associated with the group;
> >>>>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>
> >>>For the cause of DDW and multiple windows, I'm assuming you can call
> >>>this multiple times with different LIOBNs and the same IOMMU group?
> >>
> >>
> >>Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>and 64bit windows, this is why @start_addr is there.
> >>
> >>
> >>>>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>index 1059846..dfa3488 100644
> >>>>--- a/arch/powerpc/kvm/Kconfig
> >>>>+++ b/arch/powerpc/kvm/Kconfig
> >>>>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>  	select KVM
> >>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>+	select KVM_VFIO if VFIO
> >>>>  	---help---
> >>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>  	  in virtual machines on book3s_64 host processors.
> >>>>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>index 7f7b6d8..71f577c 100644
> >>>>--- a/arch/powerpc/kvm/Makefile
> >>>>+++ b/arch/powerpc/kvm/Makefile
> >>>>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>  KVM := ../../../virt/kvm
> >>>>
> >>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>+		$(KVM)/eventfd.o
> >>>
> >>>Please don't disable the VFIO device for the non-book3s case.  I added
> >>>it (even though it didn't do anything until now) so that libvirt
> >>>wouldn't choke when it finds it's not available.  Obviously the new
> >>>ioctl needs to be only for the right IOMMU setup, but the device
> >>>itself should be available always.
> >>
> >>Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>
> >>
> >>>>  CFLAGS_e500_mmu.o := -I.
> >>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>@@ -87,6 +87,9 @@ endif
> >>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>  	book3s_xics.o
> >>>>
> >>>>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>+	$(KVM)/vfio.o \
> >>>>+
> >>>>  kvm-book3s_64-module-objs += \
> >>>>  	$(KVM)/kvm_main.o \
> >>>>  	$(KVM)/eventfd.o \
> >>>>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>index 19aa59b..63f188d 100644
> >>>>--- a/arch/powerpc/kvm/powerpc.c
> >>>>+++ b/arch/powerpc/kvm/powerpc.c
> >>>>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>  	case KVM_CAP_PPC_RTAS:
> >>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>index 080ffbf..f1abbea 100644
> >>>>--- a/include/uapi/linux/kvm.h
> >>>>+++ b/include/uapi/linux/kvm.h
> >>>>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>
> >>>>  enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_MAX,
> >>>>  };
> >>>>
> >>>>+struct kvm_vfio_spapr_tce_liobn {
> >>>>+	__u32	argsz;
> >>>>+	__s32	fd;
> >>>>+	__u32	liobn;
> >>>>+	__u8	pad[4];
> >>>>+	__u64	start_addr;
> >>>>+};
> >>>>+
> >>>>  /*
> >>>>   * ioctls for VM fds
> >>>>   */
> >>>>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>index 1dd087d..87c771e 100644
> >>>>--- a/virt/kvm/vfio.c
> >>>>+++ b/virt/kvm/vfio.c
> >>>>@@ -20,6 +20,10 @@
> >>>>  #include <linux/vfio.h>
> >>>>  #include "vfio.h"
> >>>>
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+#include <asm/kvm_ppc.h>
> >>>>+#endif
> >>>>+
> >>>>  struct kvm_vfio_group {
> >>>>  	struct list_head node;
> >>>>  	struct vfio_group *vfio_group;
> >>>>@@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>  	symbol_put(vfio_group_put_external_user);
> >>>>  }
> >>>>
> >>>>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>+{
> >>>>+	int (*fn)(struct vfio_group *);
> >>>>+	int ret = -1;
> >>>
> >>>Should this be -ESOMETHING?
> >>>
> >>>>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>+	if (!fn)
> >>>>+		return ret;
> >>>>+
> >>>>+	ret = fn(vfio_group);
> >>>>+
> >>>>+	symbol_put(vfio_external_user_iommu_id);
> >>>>+
> >>>>+	return ret;
> >>>>+}
> >>>>+
> >>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>  {
> >>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>>>  	mutex_unlock(&kv->lock);
> >>>>  }
> >>>>
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>+		struct vfio_group *vfio_group)
> >>>
> >>>
> >>>Shouldn't this go in the same patch that introduced the attach
> >>>function?
> >>
> >>Having less patches which touch different maintainers areas is better. I
> >>cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>table".
> >>
> >>
> >>>
> >>>>+{
> >>>>+	int group_id;
> >>>>+	struct iommu_group *grp;
> >>>>+
> >>>>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>+	grp = iommu_group_get_by_id(group_id);
> >>>>+	if (grp) {
> >>>>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>+		iommu_group_put(grp);
> >>>>+	}
> >>>>+}
> >>>>+#endif
> >>>>+
> >>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  {
> >>>>  	struct kvm_vfio *kv = dev->private;
> >>>>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  				continue;
> >>>>
> >>>>  			list_del(&kvg->node);
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>
> >>>Better to make a no-op version of the call than have to #ifdef at the
> >>>callsite.
> >>
> >>It is questionable. A x86 reader may decide that
> >>KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>confused.
> >>
> >>
> >>>
> >>>>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>+					kvg->vfio_group);
> >>>>+#endif
> >>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>  			kfree(kvg);
> >>>>  			ret = 0;
> >>>>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  		kvm_vfio_update_coherency(dev);
> >>>>
> >>>>  		return ret;
> >>>>+
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>+		struct kvm_vfio_spapr_tce_liobn param;
> >>>>+		unsigned long minsz;
> >>>>+		struct kvm_vfio *kv = dev->private;
> >>>>+		struct vfio_group *vfio_group;
> >>>>+		struct kvm_vfio_group *kvg;
> >>>>+		struct fd f;
> >>>>+
> >>>>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>+				start_addr);
> >>>>+
> >>>>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>+			return -EFAULT;
> >>>>+
> >>>>+		if (param.argsz < minsz)
> >>>>+			return -EINVAL;
> >>>>+
> >>>>+		f = fdget(param.fd);
> >>>>+		if (!f.file)
> >>>>+			return -EBADF;
> >>>>+
> >>>>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>+		fdput(f);
> >>>>+
> >>>>+		if (IS_ERR(vfio_group))
> >>>>+			return PTR_ERR(vfio_group);
> >>>>+
> >>>>+		ret = -ENOENT;
> >>>
> >>>Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>possible a kernel could be built for a platform supporting multiple
> >>>IOMMU types.
> >>
> >>Well, may make sense but I do not know to test that. The IOMMU type is a
> >>VFIO container property, not a group property and here (KVM) we only have
> >>groups.
> >
> >Which, as mentioned previously, is broken.
> 
> Which I am failing to follow you on this.
> 
> What I am trying to achieve here is pretty much referencing a group so it
> cannot be reused. Plus LIOBNs.

"Plus LIOBNs" is not a trivial change.  You are establishing a linkage
from LIOBNs to groups.  But that doesn't make sense; if mapping in one
(guest) LIOBN affects a group it must affect all groups in the
container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
to group.

> Passing a container fd does not make much
> sense here as the VFIO device would walk through groups, reference them and
> that is it, there is no locking on VFIO containters and so far there was no
> need to teach KVM about containers.
> 
> What do I miss now?

Referencing the groups is essentially just a useful side effect.  The
important functionality is informing VFIO of the guest LIOBNs; and
LIOBNs map to containers, not groups.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-03-15  6:04             ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-15  6:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 12570 bytes --]

On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> On 03/10/2016 04:21 PM, David Gibson wrote:
> >On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>identifier. LIOBNs are made up, advertised to guest systems and
> >>>>linked to IOMMU groups by the user space.
> >>>>In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>to tell KVM the information about the LIOBN-to-group mapping.
> >>>>
> >>>>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>is added which accepts:
> >>>>- a VFIO group fd and IO base address to find the actual hardware
> >>>>TCE table;
> >>>>- a LIOBN to assign to the found table.
> >>>>
> >>>>Before notifying KVM about new link, this check the group for being
> >>>>registered with KVM device in order to release them at unexpected KVM
> >>>>finish.
> >>>>
> >>>>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>space.
> >>>>
> >>>>While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>link to a KVM module.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
> >>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>
> >>>>diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>index ef51740..c0d3eb7 100644
> >>>>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>@@ -16,7 +16,24 @@ Groups:
> >>>>
> >>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>+	for the VFIO group.
> >>>
> >>>AFAICT these changes are accurate for VFIO as it is already, in which
> >>>case it might be clearer to put them in a separate patch.
> >>>
> >>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>+	for the VFIO group.
> >>>>
> >>>>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>-for the VFIO group.
> >>>>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>+	kvm_device_attr.addr points to a struct:
> >>>>+		struct kvm_vfio_spapr_tce_liobn {
> >>>>+			__u32	argsz;
> >>>>+			__s32	fd;
> >>>>+			__u32	liobn;
> >>>>+			__u8	pad[4];
> >>>>+			__u64	start_addr;
> >>>>+		};
> >>>>+		where
> >>>>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>+		@fd is a file descriptor for a VFIO group;
> >>>>+		@liobn is a logical bus id to be associated with the group;
> >>>>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>
> >>>For the cause of DDW and multiple windows, I'm assuming you can call
> >>>this multiple times with different LIOBNs and the same IOMMU group?
> >>
> >>
> >>Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>and 64bit windows, this is why @start_addr is there.
> >>
> >>
> >>>>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>index 1059846..dfa3488 100644
> >>>>--- a/arch/powerpc/kvm/Kconfig
> >>>>+++ b/arch/powerpc/kvm/Kconfig
> >>>>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>  	select KVM
> >>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>+	select KVM_VFIO if VFIO
> >>>>  	---help---
> >>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>  	  in virtual machines on book3s_64 host processors.
> >>>>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>index 7f7b6d8..71f577c 100644
> >>>>--- a/arch/powerpc/kvm/Makefile
> >>>>+++ b/arch/powerpc/kvm/Makefile
> >>>>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>  KVM := ../../../virt/kvm
> >>>>
> >>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>+		$(KVM)/eventfd.o
> >>>
> >>>Please don't disable the VFIO device for the non-book3s case.  I added
> >>>it (even though it didn't do anything until now) so that libvirt
> >>>wouldn't choke when it finds it's not available.  Obviously the new
> >>>ioctl needs to be only for the right IOMMU setup, but the device
> >>>itself should be available always.
> >>
> >>Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>
> >>
> >>>>  CFLAGS_e500_mmu.o := -I.
> >>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>@@ -87,6 +87,9 @@ endif
> >>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>  	book3s_xics.o
> >>>>
> >>>>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>+	$(KVM)/vfio.o \
> >>>>+
> >>>>  kvm-book3s_64-module-objs += \
> >>>>  	$(KVM)/kvm_main.o \
> >>>>  	$(KVM)/eventfd.o \
> >>>>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>index 19aa59b..63f188d 100644
> >>>>--- a/arch/powerpc/kvm/powerpc.c
> >>>>+++ b/arch/powerpc/kvm/powerpc.c
> >>>>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>  	case KVM_CAP_PPC_RTAS:
> >>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>index 080ffbf..f1abbea 100644
> >>>>--- a/include/uapi/linux/kvm.h
> >>>>+++ b/include/uapi/linux/kvm.h
> >>>>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>
> >>>>  enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_MAX,
> >>>>  };
> >>>>
> >>>>+struct kvm_vfio_spapr_tce_liobn {
> >>>>+	__u32	argsz;
> >>>>+	__s32	fd;
> >>>>+	__u32	liobn;
> >>>>+	__u8	pad[4];
> >>>>+	__u64	start_addr;
> >>>>+};
> >>>>+
> >>>>  /*
> >>>>   * ioctls for VM fds
> >>>>   */
> >>>>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>index 1dd087d..87c771e 100644
> >>>>--- a/virt/kvm/vfio.c
> >>>>+++ b/virt/kvm/vfio.c
> >>>>@@ -20,6 +20,10 @@
> >>>>  #include <linux/vfio.h>
> >>>>  #include "vfio.h"
> >>>>
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+#include <asm/kvm_ppc.h>
> >>>>+#endif
> >>>>+
> >>>>  struct kvm_vfio_group {
> >>>>  	struct list_head node;
> >>>>  	struct vfio_group *vfio_group;
> >>>>@@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>  	symbol_put(vfio_group_put_external_user);
> >>>>  }
> >>>>
> >>>>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>+{
> >>>>+	int (*fn)(struct vfio_group *);
> >>>>+	int ret = -1;
> >>>
> >>>Should this be -ESOMETHING?
> >>>
> >>>>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>+	if (!fn)
> >>>>+		return ret;
> >>>>+
> >>>>+	ret = fn(vfio_group);
> >>>>+
> >>>>+	symbol_put(vfio_external_user_iommu_id);
> >>>>+
> >>>>+	return ret;
> >>>>+}
> >>>>+
> >>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>  {
> >>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>>>  	mutex_unlock(&kv->lock);
> >>>>  }
> >>>>
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>+		struct vfio_group *vfio_group)
> >>>
> >>>
> >>>Shouldn't this go in the same patch that introduced the attach
> >>>function?
> >>
> >>Having less patches which touch different maintainers areas is better. I
> >>cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>table".
> >>
> >>
> >>>
> >>>>+{
> >>>>+	int group_id;
> >>>>+	struct iommu_group *grp;
> >>>>+
> >>>>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>+	grp = iommu_group_get_by_id(group_id);
> >>>>+	if (grp) {
> >>>>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>+		iommu_group_put(grp);
> >>>>+	}
> >>>>+}
> >>>>+#endif
> >>>>+
> >>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  {
> >>>>  	struct kvm_vfio *kv = dev->private;
> >>>>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  				continue;
> >>>>
> >>>>  			list_del(&kvg->node);
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>
> >>>Better to make a no-op version of the call than have to #ifdef at the
> >>>callsite.
> >>
> >>It is questionable. A x86 reader may decide that
> >>KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>confused.
> >>
> >>
> >>>
> >>>>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>+					kvg->vfio_group);
> >>>>+#endif
> >>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>  			kfree(kvg);
> >>>>  			ret = 0;
> >>>>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  		kvm_vfio_update_coherency(dev);
> >>>>
> >>>>  		return ret;
> >>>>+
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>+		struct kvm_vfio_spapr_tce_liobn param;
> >>>>+		unsigned long minsz;
> >>>>+		struct kvm_vfio *kv = dev->private;
> >>>>+		struct vfio_group *vfio_group;
> >>>>+		struct kvm_vfio_group *kvg;
> >>>>+		struct fd f;
> >>>>+
> >>>>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>+				start_addr);
> >>>>+
> >>>>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>+			return -EFAULT;
> >>>>+
> >>>>+		if (param.argsz < minsz)
> >>>>+			return -EINVAL;
> >>>>+
> >>>>+		f = fdget(param.fd);
> >>>>+		if (!f.file)
> >>>>+			return -EBADF;
> >>>>+
> >>>>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>+		fdput(f);
> >>>>+
> >>>>+		if (IS_ERR(vfio_group))
> >>>>+			return PTR_ERR(vfio_group);
> >>>>+
> >>>>+		ret = -ENOENT;
> >>>
> >>>Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>possible a kernel could be built for a platform supporting multiple
> >>>IOMMU types.
> >>
> >>Well, may make sense but I do not know to test that. The IOMMU type is a
> >>VFIO container property, not a group property and here (KVM) we only have
> >>groups.
> >
> >Which, as mentioned previously, is broken.
> 
> Which I am failing to follow you on this.
> 
> What I am trying to achieve here is pretty much referencing a group so it
> cannot be reused. Plus LIOBNs.

"Plus LIOBNs" is not a trivial change.  You are establishing a linkage
from LIOBNs to groups.  But that doesn't make sense; if mapping in one
(guest) LIOBN affects a group it must affect all groups in the
container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
to group.

> Passing a container fd does not make much
> sense here as the VFIO device would walk through groups, reference them and
> that is it, there is no locking on VFIO containters and so far there was no
> need to teach KVM about containers.
> 
> What do I miss now?

Referencing the groups is essentially just a useful side effect.  The
important functionality is informing VFIO of the guest LIOBNs; and
LIOBNs map to containers, not groups.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
       [not found]               ` <20160321051932.GJ23586@voom.redhat.com>
@ 2016-03-22  0:34                   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  0:34 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

Uff, lost cc: list. Added back. Some comments below.


On 03/21/2016 04:19 PM, David Gibson wrote:
> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>
>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>> is added which accepts:
>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>> TCE table;
>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>
>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>> finish.
>>>>>>>>
>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>> space.
>>>>>>>>
>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>> link to a KVM module.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>   include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>   virt/kvm/vfio.c                            | 106
>>>> +++++++++++++++++++++++++++++
>>>>>>>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>
>>>>>>>>   KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> +	for the VFIO group.
>>>>>>>
>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>
>>>>>>>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>> tracking
>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> +	for the VFIO group.
>>>>>>>>
>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> -for the VFIO group.
>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>> +			__u32	argsz;
>>>>>>>> +			__s32	fd;
>>>>>>>> +			__u32	liobn;
>>>>>>>> +			__u8	pad[4];
>>>>>>>> +			__u64	start_addr;
>>>>>>>> +		};
>>>>>>>> +		where
>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>
>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>
>>>>>>
>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>
>>>>>>
>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>   	select KVM
>>>>>>>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>   	---help---
>>>>>>>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>   	  in virtual machines on book3s_64 host processors.
>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>   KVM := ../../../virt/kvm
>>>>>>>>
>>>>>>>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>
>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>> itself should be available always.
>>>>>>
>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>
>>>>>>
>>>>>>>>   CFLAGS_e500_mmu.o := -I.
>>>>>>>>   CFLAGS_e500_mmu_host.o := -I.
>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>   	book3s_xics.o
>>>>>>>>
>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>> +
>>>>>>>>   kvm-book3s_64-module-objs += \
>>>>>>>>   	$(KVM)/kvm_main.o \
>>>>>>>>   	$(KVM)/eventfd.o \
>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>> *kvm, long ext)
>>>>>>>>   #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>   	case KVM_CAP_SPAPR_TCE:
>>>>>>>>   	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>   	case KVM_CAP_PPC_RTAS:
>>>>>>>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>   #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>
>>>>>>>>   enum kvm_device_type {
>>>>>>>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>   	KVM_DEV_TYPE_MAX,
>>>>>>>>   };
>>>>>>>>
>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>> +	__u32	argsz;
>>>>>>>> +	__s32	fd;
>>>>>>>> +	__u32	liobn;
>>>>>>>> +	__u8	pad[4];
>>>>>>>> +	__u64	start_addr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>   /*
>>>>>>>>    * ioctls for VM fds
>>>>>>>>    */
>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>   #include <linux/vfio.h>
>>>>>>>>   #include "vfio.h"
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>   struct kvm_vfio_group {
>>>>>>>>   	struct list_head node;
>>>>>>>>   	struct vfio_group *vfio_group;
>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>   	symbol_put(vfio_group_put_external_user);
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>> +{
>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>> +	int ret = -1;
>>>>>>>
>>>>>>> Should this be -ESOMETHING?
>>>>>>>
>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>> +	if (!fn)
>>>>>>>> +		return ret;
>>>>>>>> +
>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>> +
>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>> +
>>>>>>>> +	return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>   {
>>>>>>>>   	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>> kvm_device *dev)
>>>>>>>>   	mutex_unlock(&kv->lock);
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>
>>>>>>>
>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>> function?
>>>>>>
>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>> table".
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +{
>>>>>>>> +	int group_id;
>>>>>>>> +	struct iommu_group *grp;
>>>>>>>> +
>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>> +	if (grp) {
>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>> +		iommu_group_put(grp);
>>>>>>>> +	}
>>>>>>>> +}
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>   {
>>>>>>>>   	struct kvm_vfio *kv = dev->private;
>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>> *dev, long attr, u64 arg)
>>>>>>>>   				continue;
>>>>>>>>
>>>>>>>>   			list_del(&kvg->node);
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>
>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>> callsite.
>>>>>>
>>>>>> It is questionable. A x86 reader may decide that
>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>> confused.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>> +					kvg->vfio_group);
>>>>>>>> +#endif
>>>>>>>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>   			kfree(kvg);
>>>>>>>>   			ret = 0;
>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>> *dev, long attr, u64 arg)
>>>>>>>>   		kvm_vfio_update_coherency(dev);
>>>>>>>>
>>>>>>>>   		return ret;
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>> +		unsigned long minsz;
>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>> +		struct fd f;
>>>>>>>> +
>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>> +				start_addr);
>>>>>>>> +
>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>> +			return -EFAULT;
>>>>>>>> +
>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>> +			return -EINVAL;
>>>>>>>> +
>>>>>>>> +		f = fdget(param.fd);
>>>>>>>> +		if (!f.file)
>>>>>>>> +			return -EBADF;
>>>>>>>> +
>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>> +		fdput(f);
>>>>>>>> +
>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>> +
>>>>>>>> +		ret = -ENOENT;
>>>>>>>
>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>> IOMMU types.
>>>>>>
>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>> groups.
>>>>>
>>>>> Which, as mentioned previously, is broken.
>>>>
>>>> Which I am failing to follow you on this.
>>>>
>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>> cannot be reused. Plus LIOBNs.
>>>
>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>> (guest) LIOBN affects a group it must affect all groups in the
>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>> to group.
>>
>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>> Pass container fd and then implement new api to lock containers somehow and
>
> I'm not really understanding what the question is about locking containers.
>
>> enumerate groups when updating TCE table (including real mode)?
>
> Why do you need to enumerate groups?  The groups within the container
> share a TCE table, so can't you just update that once?

Well, they share a TCE table but they do not share TCE Kill (TCE cache 
invalidate) register address, it is still per PE but this does not matter 
here (pnv_pci_link_table_and_group() does that), just mentioned to complete 
the picture.


>> Plus new API when we remove a group from a container as the result of guest
>> PCI hot unplug?
>
> I assume you mean a kernel internal API, since it shouldn't need
> anything else visible to userspace.  Won't this happen naturally?
> When the group is removed from the container, it will get its own TCE
> table instead of the previously shared one.
 >
>>>> Passing a container fd does not make much
>>>> sense here as the VFIO device would walk through groups, reference them and
>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>> need to teach KVM about containers.
>>>>
>>>> What do I miss now?
>>>
>>> Referencing the groups is essentially just a useful side effect.  The
>>> important functionality is informing VFIO of the guest LIOBNs; and
>>> LIOBNs map to containers, not groups.
>>
>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>> can be one or many or none containers per liobn.
>
> Ah, true.

So I need to add new kernel API for KVM to get table(s) from VFIO 
container(s). And invent some locking mechanism to prevent table(s) (or 
associated container(s)) from going away, like 
vfio_group_get_external_user/vfio_group_put_external_user but for 
containers. Correct?



-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-03-22  0:34                   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  0:34 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

Uff, lost cc: list. Added back. Some comments below.


On 03/21/2016 04:19 PM, David Gibson wrote:
> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>
>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>> is added which accepts:
>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>> TCE table;
>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>
>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>> finish.
>>>>>>>>
>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>> space.
>>>>>>>>
>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>> link to a KVM module.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>   include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>   virt/kvm/vfio.c                            | 106
>>>> +++++++++++++++++++++++++++++
>>>>>>>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>
>>>>>>>>   KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> +	for the VFIO group.
>>>>>>>
>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>
>>>>>>>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>> tracking
>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> +	for the VFIO group.
>>>>>>>>
>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> -for the VFIO group.
>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>> +			__u32	argsz;
>>>>>>>> +			__s32	fd;
>>>>>>>> +			__u32	liobn;
>>>>>>>> +			__u8	pad[4];
>>>>>>>> +			__u64	start_addr;
>>>>>>>> +		};
>>>>>>>> +		where
>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>
>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>
>>>>>>
>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>
>>>>>>
>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>   	select KVM
>>>>>>>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>   	---help---
>>>>>>>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>   	  in virtual machines on book3s_64 host processors.
>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>   KVM := ../../../virt/kvm
>>>>>>>>
>>>>>>>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>
>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>> itself should be available always.
>>>>>>
>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>
>>>>>>
>>>>>>>>   CFLAGS_e500_mmu.o := -I.
>>>>>>>>   CFLAGS_e500_mmu_host.o := -I.
>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>   	book3s_xics.o
>>>>>>>>
>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>> +
>>>>>>>>   kvm-book3s_64-module-objs += \
>>>>>>>>   	$(KVM)/kvm_main.o \
>>>>>>>>   	$(KVM)/eventfd.o \
>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>> *kvm, long ext)
>>>>>>>>   #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>   	case KVM_CAP_SPAPR_TCE:
>>>>>>>>   	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>   	case KVM_CAP_PPC_RTAS:
>>>>>>>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>   #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>
>>>>>>>>   enum kvm_device_type {
>>>>>>>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>   	KVM_DEV_TYPE_MAX,
>>>>>>>>   };
>>>>>>>>
>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>> +	__u32	argsz;
>>>>>>>> +	__s32	fd;
>>>>>>>> +	__u32	liobn;
>>>>>>>> +	__u8	pad[4];
>>>>>>>> +	__u64	start_addr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>   /*
>>>>>>>>    * ioctls for VM fds
>>>>>>>>    */
>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>   #include <linux/vfio.h>
>>>>>>>>   #include "vfio.h"
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>   struct kvm_vfio_group {
>>>>>>>>   	struct list_head node;
>>>>>>>>   	struct vfio_group *vfio_group;
>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>   	symbol_put(vfio_group_put_external_user);
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>> +{
>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>> +	int ret = -1;
>>>>>>>
>>>>>>> Should this be -ESOMETHING?
>>>>>>>
>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>> +	if (!fn)
>>>>>>>> +		return ret;
>>>>>>>> +
>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>> +
>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>> +
>>>>>>>> +	return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>   {
>>>>>>>>   	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>> kvm_device *dev)
>>>>>>>>   	mutex_unlock(&kv->lock);
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>
>>>>>>>
>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>> function?
>>>>>>
>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>> table".
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +{
>>>>>>>> +	int group_id;
>>>>>>>> +	struct iommu_group *grp;
>>>>>>>> +
>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>> +	if (grp) {
>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>> +		iommu_group_put(grp);
>>>>>>>> +	}
>>>>>>>> +}
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>   {
>>>>>>>>   	struct kvm_vfio *kv = dev->private;
>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>> *dev, long attr, u64 arg)
>>>>>>>>   				continue;
>>>>>>>>
>>>>>>>>   			list_del(&kvg->node);
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>
>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>> callsite.
>>>>>>
>>>>>> It is questionable. A x86 reader may decide that
>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>> confused.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>> +					kvg->vfio_group);
>>>>>>>> +#endif
>>>>>>>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>   			kfree(kvg);
>>>>>>>>   			ret = 0;
>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>> *dev, long attr, u64 arg)
>>>>>>>>   		kvm_vfio_update_coherency(dev);
>>>>>>>>
>>>>>>>>   		return ret;
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>> +		unsigned long minsz;
>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>> +		struct fd f;
>>>>>>>> +
>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>> +				start_addr);
>>>>>>>> +
>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>> +			return -EFAULT;
>>>>>>>> +
>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>> +			return -EINVAL;
>>>>>>>> +
>>>>>>>> +		f = fdget(param.fd);
>>>>>>>> +		if (!f.file)
>>>>>>>> +			return -EBADF;
>>>>>>>> +
>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>> +		fdput(f);
>>>>>>>> +
>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>> +
>>>>>>>> +		ret = -ENOENT;
>>>>>>>
>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>> IOMMU types.
>>>>>>
>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>> groups.
>>>>>
>>>>> Which, as mentioned previously, is broken.
>>>>
>>>> Which I am failing to follow you on this.
>>>>
>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>> cannot be reused. Plus LIOBNs.
>>>
>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>> (guest) LIOBN affects a group it must affect all groups in the
>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>> to group.
>>
>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>> Pass container fd and then implement new api to lock containers somehow and
>
> I'm not really understanding what the question is about locking containers.
>
>> enumerate groups when updating TCE table (including real mode)?
>
> Why do you need to enumerate groups?  The groups within the container
> share a TCE table, so can't you just update that once?

Well, they share a TCE table but they do not share TCE Kill (TCE cache 
invalidate) register address, it is still per PE but this does not matter 
here (pnv_pci_link_table_and_group() does that), just mentioned to complete 
the picture.


>> Plus new API when we remove a group from a container as the result of guest
>> PCI hot unplug?
>
> I assume you mean a kernel internal API, since it shouldn't need
> anything else visible to userspace.  Won't this happen naturally?
> When the group is removed from the container, it will get its own TCE
> table instead of the previously shared one.
 >
>>>> Passing a container fd does not make much
>>>> sense here as the VFIO device would walk through groups, reference them and
>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>> need to teach KVM about containers.
>>>>
>>>> What do I miss now?
>>>
>>> Referencing the groups is essentially just a useful side effect.  The
>>> important functionality is informing VFIO of the guest LIOBNs; and
>>> LIOBNs map to containers, not groups.
>>
>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>> can be one or many or none containers per liobn.
>
> Ah, true.

So I need to add new kernel API for KVM to get table(s) from VFIO 
container(s). And invent some locking mechanism to prevent table(s) (or 
associated container(s)) from going away, like 
vfio_group_get_external_user/vfio_group_put_external_user but for 
containers. Correct?



-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-22  0:34                   ` Alexey Kardashevskiy
@ 2016-03-23  3:03                     ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-23  3:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 16020 bytes --]

On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> Uff, lost cc: list. Added back. Some comments below.
> 
> 
> On 03/21/2016 04:19 PM, David Gibson wrote:
> >On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>
> >>>On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>linked to IOMMU groups by the user space.
> >>>>>>>>In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>
> >>>>>>>>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>is added which accepts:
> >>>>>>>>- a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>TCE table;
> >>>>>>>>- a LIOBN to assign to the found table.
> >>>>>>>>
> >>>>>>>>Before notifying KVM about new link, this check the group for being
> >>>>>>>>registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>finish.
> >>>>>>>>
> >>>>>>>>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>space.
> >>>>>>>>
> >>>>>>>>While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>link to a KVM module.
> >>>>>>>>
> >>>>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>---
> >>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>+++++++++++++++++++++++++++++
> >>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>
> >>>>>>>>diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>index ef51740..c0d3eb7 100644
> >>>>>>>>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>@@ -16,7 +16,24 @@ Groups:
> >>>>>>>>
> >>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>+	for the VFIO group.
> >>>>>>>
> >>>>>>>AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>case it might be clearer to put them in a separate patch.
> >>>>>>>
> >>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>tracking
> >>>>>>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>+	for the VFIO group.
> >>>>>>>>
> >>>>>>>>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>-for the VFIO group.
> >>>>>>>>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>+	kvm_device_attr.addr points to a struct:
> >>>>>>>>+		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>+			__u32	argsz;
> >>>>>>>>+			__s32	fd;
> >>>>>>>>+			__u32	liobn;
> >>>>>>>>+			__u8	pad[4];
> >>>>>>>>+			__u64	start_addr;
> >>>>>>>>+		};
> >>>>>>>>+		where
> >>>>>>>>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>+		@fd is a file descriptor for a VFIO group;
> >>>>>>>>+		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>
> >>>>>>>For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>
> >>>>>>
> >>>>>>Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>and 64bit windows, this is why @start_addr is there.
> >>>>>>
> >>>>>>
> >>>>>>>>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>index 1059846..dfa3488 100644
> >>>>>>>>--- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>+++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>  	select KVM
> >>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>+	select KVM_VFIO if VFIO
> >>>>>>>>  	---help---
> >>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>index 7f7b6d8..71f577c 100644
> >>>>>>>>--- a/arch/powerpc/kvm/Makefile
> >>>>>>>>+++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>
> >>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>+		$(KVM)/eventfd.o
> >>>>>>>
> >>>>>>>Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>it (even though it didn't do anything until now) so that libvirt
> >>>>>>>wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>itself should be available always.
> >>>>>>
> >>>>>>Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>
> >>>>>>
> >>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>@@ -87,6 +87,9 @@ endif
> >>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>  	book3s_xics.o
> >>>>>>>>
> >>>>>>>>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>+	$(KVM)/vfio.o \
> >>>>>>>>+
> >>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>index 19aa59b..63f188d 100644
> >>>>>>>>--- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>+++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>*kvm, long ext)
> >>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>index 080ffbf..f1abbea 100644
> >>>>>>>>--- a/include/uapi/linux/kvm.h
> >>>>>>>>+++ b/include/uapi/linux/kvm.h
> >>>>>>>>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>
> >>>>>>>>  enum kvm_device_type {
> >>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>  };
> >>>>>>>>
> >>>>>>>>+struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>+	__u32	argsz;
> >>>>>>>>+	__s32	fd;
> >>>>>>>>+	__u32	liobn;
> >>>>>>>>+	__u8	pad[4];
> >>>>>>>>+	__u64	start_addr;
> >>>>>>>>+};
> >>>>>>>>+
> >>>>>>>>  /*
> >>>>>>>>   * ioctls for VM fds
> >>>>>>>>   */
> >>>>>>>>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>index 1dd087d..87c771e 100644
> >>>>>>>>--- a/virt/kvm/vfio.c
> >>>>>>>>+++ b/virt/kvm/vfio.c
> >>>>>>>>@@ -20,6 +20,10 @@
> >>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>  #include "vfio.h"
> >>>>>>>>
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+#include <asm/kvm_ppc.h>
> >>>>>>>>+#endif
> >>>>>>>>+
> >>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>  	struct list_head node;
> >>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>@@ -60,6 +64,22 @@ static void
> >>>>kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>+{
> >>>>>>>>+	int (*fn)(struct vfio_group *);
> >>>>>>>>+	int ret = -1;
> >>>>>>>
> >>>>>>>Should this be -ESOMETHING?
> >>>>>>>
> >>>>>>>>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>+	if (!fn)
> >>>>>>>>+		return ret;
> >>>>>>>>+
> >>>>>>>>+	ret = fn(vfio_group);
> >>>>>>>>+
> >>>>>>>>+	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>+
> >>>>>>>>+	return ret;
> >>>>>>>>+}
> >>>>>>>>+
> >>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>  {
> >>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>kvm_device *dev)
> >>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>+		struct vfio_group *vfio_group)
> >>>>>>>
> >>>>>>>
> >>>>>>>Shouldn't this go in the same patch that introduced the attach
> >>>>>>>function?
> >>>>>>
> >>>>>>Having less patches which touch different maintainers areas is better. I
> >>>>>>cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>table".
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>+{
> >>>>>>>>+	int group_id;
> >>>>>>>>+	struct iommu_group *grp;
> >>>>>>>>+
> >>>>>>>>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>+	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>+	if (grp) {
> >>>>>>>>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>+		iommu_group_put(grp);
> >>>>>>>>+	}
> >>>>>>>>+}
> >>>>>>>>+#endif
> >>>>>>>>+
> >>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>  {
> >>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>*dev, long attr, u64 arg)
> >>>>>>>>  				continue;
> >>>>>>>>
> >>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>
> >>>>>>>Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>callsite.
> >>>>>>
> >>>>>>It is questionable. A x86 reader may decide that
> >>>>>>KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>confused.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>+					kvg->vfio_group);
> >>>>>>>>+#endif
> >>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>  			kfree(kvg);
> >>>>>>>>  			ret = 0;
> >>>>>>>>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>*dev, long attr, u64 arg)
> >>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>
> >>>>>>>>  		return ret;
> >>>>>>>>+
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>+		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>+		unsigned long minsz;
> >>>>>>>>+		struct kvm_vfio *kv = dev->private;
> >>>>>>>>+		struct vfio_group *vfio_group;
> >>>>>>>>+		struct kvm_vfio_group *kvg;
> >>>>>>>>+		struct fd f;
> >>>>>>>>+
> >>>>>>>>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>+				start_addr);
> >>>>>>>>+
> >>>>>>>>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>+			return -EFAULT;
> >>>>>>>>+
> >>>>>>>>+		if (param.argsz < minsz)
> >>>>>>>>+			return -EINVAL;
> >>>>>>>>+
> >>>>>>>>+		f = fdget(param.fd);
> >>>>>>>>+		if (!f.file)
> >>>>>>>>+			return -EBADF;
> >>>>>>>>+
> >>>>>>>>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>+		fdput(f);
> >>>>>>>>+
> >>>>>>>>+		if (IS_ERR(vfio_group))
> >>>>>>>>+			return PTR_ERR(vfio_group);
> >>>>>>>>+
> >>>>>>>>+		ret = -ENOENT;
> >>>>>>>
> >>>>>>>Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>possible a kernel could be built for a platform supporting multiple
> >>>>>>>IOMMU types.
> >>>>>>
> >>>>>>Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>VFIO container property, not a group property and here (KVM) we only have
> >>>>>>groups.
> >>>>>
> >>>>>Which, as mentioned previously, is broken.
> >>>>
> >>>>Which I am failing to follow you on this.
> >>>>
> >>>>What I am trying to achieve here is pretty much referencing a group so it
> >>>>cannot be reused. Plus LIOBNs.
> >>>
> >>>"Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>(guest) LIOBN affects a group it must affect all groups in the
> >>>container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>to group.
> >>
> >>I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>Pass container fd and then implement new api to lock containers somehow and
> >
> >I'm not really understanding what the question is about locking containers.
> >
> >>enumerate groups when updating TCE table (including real mode)?
> >
> >Why do you need to enumerate groups?  The groups within the container
> >share a TCE table, so can't you just update that once?
> 
> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> invalidate) register address, it is still per PE but this does not matter
> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> the picture.

True, you'll need to enumerate the groups for invalidates.  But you
need that already, right.

> >>Plus new API when we remove a group from a container as the result of guest
> >>PCI hot unplug?
> >
> >I assume you mean a kernel internal API, since it shouldn't need
> >anything else visible to userspace.  Won't this happen naturally?
> >When the group is removed from the container, it will get its own TCE
> >table instead of the previously shared one.
> >
> >>>>Passing a container fd does not make much
> >>>>sense here as the VFIO device would walk through groups, reference them and
> >>>>that is it, there is no locking on VFIO containters and so far there was no
> >>>>need to teach KVM about containers.
> >>>>
> >>>>What do I miss now?
> >>>
> >>>Referencing the groups is essentially just a useful side effect.  The
> >>>important functionality is informing VFIO of the guest LIOBNs; and
> >>>LIOBNs map to containers, not groups.
> >>
> >>No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>can be one or many or none containers per liobn.
> >
> >Ah, true.
> 
> So I need to add new kernel API for KVM to get table(s) from VFIO
> container(s). And invent some locking mechanism to prevent table(s) (or
> associated container(s)) from going away, like
> vfio_group_get_external_user/vfio_group_put_external_user but for
> containers. Correct?

Well, the container is attached to an fd, so if you get a reference on
the file* that should do it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-03-23  3:03                     ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-03-23  3:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 16020 bytes --]

On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> Uff, lost cc: list. Added back. Some comments below.
> 
> 
> On 03/21/2016 04:19 PM, David Gibson wrote:
> >On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>
> >>>On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>linked to IOMMU groups by the user space.
> >>>>>>>>In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>
> >>>>>>>>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>is added which accepts:
> >>>>>>>>- a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>TCE table;
> >>>>>>>>- a LIOBN to assign to the found table.
> >>>>>>>>
> >>>>>>>>Before notifying KVM about new link, this check the group for being
> >>>>>>>>registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>finish.
> >>>>>>>>
> >>>>>>>>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>space.
> >>>>>>>>
> >>>>>>>>While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>link to a KVM module.
> >>>>>>>>
> >>>>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>---
> >>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>+++++++++++++++++++++++++++++
> >>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>
> >>>>>>>>diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>index ef51740..c0d3eb7 100644
> >>>>>>>>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>@@ -16,7 +16,24 @@ Groups:
> >>>>>>>>
> >>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>+	for the VFIO group.
> >>>>>>>
> >>>>>>>AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>case it might be clearer to put them in a separate patch.
> >>>>>>>
> >>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>tracking
> >>>>>>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>+	for the VFIO group.
> >>>>>>>>
> >>>>>>>>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>-for the VFIO group.
> >>>>>>>>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>+	kvm_device_attr.addr points to a struct:
> >>>>>>>>+		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>+			__u32	argsz;
> >>>>>>>>+			__s32	fd;
> >>>>>>>>+			__u32	liobn;
> >>>>>>>>+			__u8	pad[4];
> >>>>>>>>+			__u64	start_addr;
> >>>>>>>>+		};
> >>>>>>>>+		where
> >>>>>>>>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>+		@fd is a file descriptor for a VFIO group;
> >>>>>>>>+		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>
> >>>>>>>For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>
> >>>>>>
> >>>>>>Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>and 64bit windows, this is why @start_addr is there.
> >>>>>>
> >>>>>>
> >>>>>>>>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>index 1059846..dfa3488 100644
> >>>>>>>>--- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>+++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>  	select KVM
> >>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>+	select KVM_VFIO if VFIO
> >>>>>>>>  	---help---
> >>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>index 7f7b6d8..71f577c 100644
> >>>>>>>>--- a/arch/powerpc/kvm/Makefile
> >>>>>>>>+++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>
> >>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>+		$(KVM)/eventfd.o
> >>>>>>>
> >>>>>>>Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>it (even though it didn't do anything until now) so that libvirt
> >>>>>>>wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>itself should be available always.
> >>>>>>
> >>>>>>Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>
> >>>>>>
> >>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>@@ -87,6 +87,9 @@ endif
> >>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>  	book3s_xics.o
> >>>>>>>>
> >>>>>>>>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>+	$(KVM)/vfio.o \
> >>>>>>>>+
> >>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>index 19aa59b..63f188d 100644
> >>>>>>>>--- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>+++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>*kvm, long ext)
> >>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>index 080ffbf..f1abbea 100644
> >>>>>>>>--- a/include/uapi/linux/kvm.h
> >>>>>>>>+++ b/include/uapi/linux/kvm.h
> >>>>>>>>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>
> >>>>>>>>  enum kvm_device_type {
> >>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>  };
> >>>>>>>>
> >>>>>>>>+struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>+	__u32	argsz;
> >>>>>>>>+	__s32	fd;
> >>>>>>>>+	__u32	liobn;
> >>>>>>>>+	__u8	pad[4];
> >>>>>>>>+	__u64	start_addr;
> >>>>>>>>+};
> >>>>>>>>+
> >>>>>>>>  /*
> >>>>>>>>   * ioctls for VM fds
> >>>>>>>>   */
> >>>>>>>>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>index 1dd087d..87c771e 100644
> >>>>>>>>--- a/virt/kvm/vfio.c
> >>>>>>>>+++ b/virt/kvm/vfio.c
> >>>>>>>>@@ -20,6 +20,10 @@
> >>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>  #include "vfio.h"
> >>>>>>>>
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+#include <asm/kvm_ppc.h>
> >>>>>>>>+#endif
> >>>>>>>>+
> >>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>  	struct list_head node;
> >>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>@@ -60,6 +64,22 @@ static void
> >>>>kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>+{
> >>>>>>>>+	int (*fn)(struct vfio_group *);
> >>>>>>>>+	int ret = -1;
> >>>>>>>
> >>>>>>>Should this be -ESOMETHING?
> >>>>>>>
> >>>>>>>>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>+	if (!fn)
> >>>>>>>>+		return ret;
> >>>>>>>>+
> >>>>>>>>+	ret = fn(vfio_group);
> >>>>>>>>+
> >>>>>>>>+	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>+
> >>>>>>>>+	return ret;
> >>>>>>>>+}
> >>>>>>>>+
> >>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>  {
> >>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>kvm_device *dev)
> >>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>+		struct vfio_group *vfio_group)
> >>>>>>>
> >>>>>>>
> >>>>>>>Shouldn't this go in the same patch that introduced the attach
> >>>>>>>function?
> >>>>>>
> >>>>>>Having less patches which touch different maintainers areas is better. I
> >>>>>>cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>table".
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>+{
> >>>>>>>>+	int group_id;
> >>>>>>>>+	struct iommu_group *grp;
> >>>>>>>>+
> >>>>>>>>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>+	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>+	if (grp) {
> >>>>>>>>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>+		iommu_group_put(grp);
> >>>>>>>>+	}
> >>>>>>>>+}
> >>>>>>>>+#endif
> >>>>>>>>+
> >>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>  {
> >>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>*dev, long attr, u64 arg)
> >>>>>>>>  				continue;
> >>>>>>>>
> >>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>
> >>>>>>>Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>callsite.
> >>>>>>
> >>>>>>It is questionable. A x86 reader may decide that
> >>>>>>KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>confused.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>+					kvg->vfio_group);
> >>>>>>>>+#endif
> >>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>  			kfree(kvg);
> >>>>>>>>  			ret = 0;
> >>>>>>>>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>*dev, long attr, u64 arg)
> >>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>
> >>>>>>>>  		return ret;
> >>>>>>>>+
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>+		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>+		unsigned long minsz;
> >>>>>>>>+		struct kvm_vfio *kv = dev->private;
> >>>>>>>>+		struct vfio_group *vfio_group;
> >>>>>>>>+		struct kvm_vfio_group *kvg;
> >>>>>>>>+		struct fd f;
> >>>>>>>>+
> >>>>>>>>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>+				start_addr);
> >>>>>>>>+
> >>>>>>>>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>+			return -EFAULT;
> >>>>>>>>+
> >>>>>>>>+		if (param.argsz < minsz)
> >>>>>>>>+			return -EINVAL;
> >>>>>>>>+
> >>>>>>>>+		f = fdget(param.fd);
> >>>>>>>>+		if (!f.file)
> >>>>>>>>+			return -EBADF;
> >>>>>>>>+
> >>>>>>>>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>+		fdput(f);
> >>>>>>>>+
> >>>>>>>>+		if (IS_ERR(vfio_group))
> >>>>>>>>+			return PTR_ERR(vfio_group);
> >>>>>>>>+
> >>>>>>>>+		ret = -ENOENT;
> >>>>>>>
> >>>>>>>Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>possible a kernel could be built for a platform supporting multiple
> >>>>>>>IOMMU types.
> >>>>>>
> >>>>>>Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>VFIO container property, not a group property and here (KVM) we only have
> >>>>>>groups.
> >>>>>
> >>>>>Which, as mentioned previously, is broken.
> >>>>
> >>>>Which I am failing to follow you on this.
> >>>>
> >>>>What I am trying to achieve here is pretty much referencing a group so it
> >>>>cannot be reused. Plus LIOBNs.
> >>>
> >>>"Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>(guest) LIOBN affects a group it must affect all groups in the
> >>>container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>to group.
> >>
> >>I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>Pass container fd and then implement new api to lock containers somehow and
> >
> >I'm not really understanding what the question is about locking containers.
> >
> >>enumerate groups when updating TCE table (including real mode)?
> >
> >Why do you need to enumerate groups?  The groups within the container
> >share a TCE table, so can't you just update that once?
> 
> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> invalidate) register address, it is still per PE but this does not matter
> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> the picture.

True, you'll need to enumerate the groups for invalidates.  But you
need that already, right.

> >>Plus new API when we remove a group from a container as the result of guest
> >>PCI hot unplug?
> >
> >I assume you mean a kernel internal API, since it shouldn't need
> >anything else visible to userspace.  Won't this happen naturally?
> >When the group is removed from the container, it will get its own TCE
> >table instead of the previously shared one.
> >
> >>>>Passing a container fd does not make much
> >>>>sense here as the VFIO device would walk through groups, reference them and
> >>>>that is it, there is no locking on VFIO containters and so far there was no
> >>>>need to teach KVM about containers.
> >>>>
> >>>>What do I miss now?
> >>>
> >>>Referencing the groups is essentially just a useful side effect.  The
> >>>important functionality is informing VFIO of the guest LIOBNs; and
> >>>LIOBNs map to containers, not groups.
> >>
> >>No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>can be one or many or none containers per liobn.
> >
> >Ah, true.
> 
> So I need to add new kernel API for KVM to get table(s) from VFIO
> container(s). And invent some locking mechanism to prevent table(s) (or
> associated container(s)) from going away, like
> vfio_group_get_external_user/vfio_group_put_external_user but for
> containers. Correct?

Well, the container is attached to an fd, so if you get a reference on
the file* that should do it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-09  5:45     ` David Gibson
@ 2016-04-08  9:13       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-08  9:13 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/09/2016 04:45 PM, David Gibson wrote:

>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 7f7b6d8..71f577c 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>   KVM := ../../../virt/kvm
>>
>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>> +		$(KVM)/eventfd.o
>
> Please don't disable the VFIO device for the non-book3s case.  I added
> it (even though it didn't do anything until now) so that libvirt
> wouldn't choke when it finds it's not available.  Obviously the new
> ioctl needs to be only for the right IOMMU setup, but the device
> itself should be available always.


After having a closer look, the statement above does not enable VFIO KVM 
device on book3s but does for everything else:


common-objs-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
[...]
kvm-e500-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_E500V2) := $(kvm-e500-objs)
[...]
kvm-e500mc-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs)
[...]
kvm-book3s_32-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)


This is becaise CONFIG_KVM_BOOK3S_64 does not use "common-objs-y":

kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)


So I will keep vfio.o in the "common-objs-y" list and add:

+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
+	$(KVM)/vfio.o


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-04-08  9:13       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-08  9:13 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 03/09/2016 04:45 PM, David Gibson wrote:

>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 7f7b6d8..71f577c 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>   KVM := ../../../virt/kvm
>>
>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>> +		$(KVM)/eventfd.o
>
> Please don't disable the VFIO device for the non-book3s case.  I added
> it (even though it didn't do anything until now) so that libvirt
> wouldn't choke when it finds it's not available.  Obviously the new
> ioctl needs to be only for the right IOMMU setup, but the device
> itself should be available always.


After having a closer look, the statement above does not enable VFIO KVM 
device on book3s but does for everything else:


common-objs-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
[...]
kvm-e500-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_E500V2) := $(kvm-e500-objs)
[...]
kvm-e500mc-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs)
[...]
kvm-book3s_32-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)


This is becaise CONFIG_KVM_BOOK3S_64 does not use "common-objs-y":

kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)


So I will keep vfio.o in the "common-objs-y" list and add:

+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
+	$(KVM)/vfio.o


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-04-08  9:13       ` Alexey Kardashevskiy
@ 2016-04-11  3:36         ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-04-11  3:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1954 bytes --]

On Fri, Apr 08, 2016 at 07:13:06PM +1000, Alexey Kardashevskiy wrote:
> On 03/09/2016 04:45 PM, David Gibson wrote:
> 
> >>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>index 7f7b6d8..71f577c 100644
> >>--- a/arch/powerpc/kvm/Makefile
> >>+++ b/arch/powerpc/kvm/Makefile
> >>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>  KVM := ../../../virt/kvm
> >>
> >>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>+		$(KVM)/eventfd.o
> >
> >Please don't disable the VFIO device for the non-book3s case.  I added
> >it (even though it didn't do anything until now) so that libvirt
> >wouldn't choke when it finds it's not available.  Obviously the new
> >ioctl needs to be only for the right IOMMU setup, but the device
> >itself should be available always.
> 
> 
> After having a closer look, the statement above does not enable VFIO KVM
> device on book3s but does for everything else:
> 
> 
> common-objs-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
> [...]
> kvm-e500-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_E500V2) := $(kvm-e500-objs)
> [...]
> kvm-e500mc-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs)
> [...]
> kvm-book3s_32-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
> 
> 
> This is becaise CONFIG_KVM_BOOK3S_64 does not use "common-objs-y":

Oh, good grief.

> kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)
> 
> 
> So I will keep vfio.o in the "common-objs-y" list and add:
> 
> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> +	$(KVM)/vfio.o

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-04-11  3:36         ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-04-11  3:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1954 bytes --]

On Fri, Apr 08, 2016 at 07:13:06PM +1000, Alexey Kardashevskiy wrote:
> On 03/09/2016 04:45 PM, David Gibson wrote:
> 
> >>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>index 7f7b6d8..71f577c 100644
> >>--- a/arch/powerpc/kvm/Makefile
> >>+++ b/arch/powerpc/kvm/Makefile
> >>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>  KVM := ../../../virt/kvm
> >>
> >>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>+		$(KVM)/eventfd.o
> >
> >Please don't disable the VFIO device for the non-book3s case.  I added
> >it (even though it didn't do anything until now) so that libvirt
> >wouldn't choke when it finds it's not available.  Obviously the new
> >ioctl needs to be only for the right IOMMU setup, but the device
> >itself should be available always.
> 
> 
> After having a closer look, the statement above does not enable VFIO KVM
> device on book3s but does for everything else:
> 
> 
> common-objs-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
> [...]
> kvm-e500-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_E500V2) := $(kvm-e500-objs)
> [...]
> kvm-e500mc-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs)
> [...]
> kvm-book3s_32-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
> 
> 
> This is becaise CONFIG_KVM_BOOK3S_64 does not use "common-objs-y":

Oh, good grief.

> kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)
> 
> 
> So I will keep vfio.o in the "common-objs-y" list and add:
> 
> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> +	$(KVM)/vfio.o

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-03-23  3:03                     ` David Gibson
@ 2016-06-09  6:47                       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-09  6:47 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 23/03/16 14:03, David Gibson wrote:
> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>> Uff, lost cc: list. Added back. Some comments below.
>>
>>
>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>
>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>
>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>> is added which accepts:
>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>> TCE table;
>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>
>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>>>> finish.
>>>>>>>>>>
>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>>>> space.
>>>>>>>>>>
>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>>>> link to a KVM module.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>
>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>
>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>
>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>> tracking
>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>
>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> -for the VFIO group.
>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>> +			__u32	argsz;
>>>>>>>>>> +			__s32	fd;
>>>>>>>>>> +			__u32	liobn;
>>>>>>>>>> +			__u8	pad[4];
>>>>>>>>>> +			__u64	start_addr;
>>>>>>>>>> +		};
>>>>>>>>>> +		where
>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>
>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>  	select KVM
>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>>>  	---help---
>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>
>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>>>
>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>> itself should be available always.
>>>>>>>>
>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>  	book3s_xics.o
>>>>>>>>>>
>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>>>> +
>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>  	$(KVM)/kvm_main.o \
>>>>>>>>>>  	$(KVM)/eventfd.o \
>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>> *kvm, long ext)
>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>>>
>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>>>>>>  };
>>>>>>>>>>
>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>> +	__u32	argsz;
>>>>>>>>>> +	__s32	fd;
>>>>>>>>>> +	__u32	liobn;
>>>>>>>>>> +	__u8	pad[4];
>>>>>>>>>> +	__u64	start_addr;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>>  /*
>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>   */
>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>  	struct list_head node;
>>>>>>>>>>  	struct vfio_group *vfio_group;
>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>>>> +{
>>>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>>>> +	int ret = -1;
>>>>>>>>>
>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>
>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>> +	if (!fn)
>>>>>>>>>> +		return ret;
>>>>>>>>>> +
>>>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>>>> +
>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>>>  {
>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>> kvm_device *dev)
>>>>>>>>>>  	mutex_unlock(&kv->lock);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>> function?
>>>>>>>>
>>>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>>>> table".
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +{
>>>>>>>>>> +	int group_id;
>>>>>>>>>> +	struct iommu_group *grp;
>>>>>>>>>> +
>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>>>> +	if (grp) {
>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>> +		iommu_group_put(grp);
>>>>>>>>>> +	}
>>>>>>>>>> +}
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>>>  {
>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>  				continue;
>>>>>>>>>>
>>>>>>>>>>  			list_del(&kvg->node);
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>
>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>>>> callsite.
>>>>>>>>
>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>> confused.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>> +					kvg->vfio_group);
>>>>>>>>>> +#endif
>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>  			kfree(kvg);
>>>>>>>>>>  			ret = 0;
>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>>>>>>
>>>>>>>>>>  		return ret;
>>>>>>>>>> +
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>> +		unsigned long minsz;
>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>>>> +		struct fd f;
>>>>>>>>>> +
>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>> +				start_addr);
>>>>>>>>>> +
>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>> +			return -EFAULT;
>>>>>>>>>> +
>>>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>>>> +			return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> +		f = fdget(param.fd);
>>>>>>>>>> +		if (!f.file)
>>>>>>>>>> +			return -EBADF;
>>>>>>>>>> +
>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>> +		fdput(f);
>>>>>>>>>> +
>>>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>>>> +
>>>>>>>>>> +		ret = -ENOENT;
>>>>>>>>>
>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>> IOMMU types.
>>>>>>>>
>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>>>> groups.
>>>>>>>
>>>>>>> Which, as mentioned previously, is broken.
>>>>>>
>>>>>> Which I am failing to follow you on this.
>>>>>>
>>>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>>>> cannot be reused. Plus LIOBNs.
>>>>>
>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>> to group.
>>>>
>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>>>> Pass container fd and then implement new api to lock containers somehow and
>>>
>>> I'm not really understanding what the question is about locking containers.
>>>
>>>> enumerate groups when updating TCE table (including real mode)?
>>>
>>> Why do you need to enumerate groups?  The groups within the container
>>> share a TCE table, so can't you just update that once?
>>
>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>> invalidate) register address, it is still per PE but this does not matter
>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>> the picture.
> 
> True, you'll need to enumerate the groups for invalidates.  But you
> need that already, right.
> 
>>>> Plus new API when we remove a group from a container as the result of guest
>>>> PCI hot unplug?
>>>
>>> I assume you mean a kernel internal API, since it shouldn't need
>>> anything else visible to userspace.  Won't this happen naturally?
>>> When the group is removed from the container, it will get its own TCE
>>> table instead of the previously shared one.
>>>
>>>>>> Passing a container fd does not make much
>>>>>> sense here as the VFIO device would walk through groups, reference them and
>>>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>>>> need to teach KVM about containers.
>>>>>>
>>>>>> What do I miss now?
>>>>>
>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>> LIOBNs map to containers, not groups.
>>>>
>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>> can be one or many or none containers per liobn.
>>>
>>> Ah, true.
>>
>> So I need to add new kernel API for KVM to get table(s) from VFIO
>> container(s). And invent some locking mechanism to prevent table(s) (or
>> associated container(s)) from going away, like
>> vfio_group_get_external_user/vfio_group_put_external_user but for
>> containers. Correct?
> 
> Well, the container is attached to an fd, so if you get a reference on
> the file* that should do it.

I am still trying to think of how to implement this suggestion.

I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
interface as it knows nothing about KVM. There is VFIO-KVM device but it
does not have idea about containers.

So I have to:

Wenever a container is created or removed, notify the VFIO-KVM device by
passing there a container fd. ok.

Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
what LIOBN so the realmode handlers could do the job. The real mode TCE
handlers get LIOBN, find a guest view table and update it. Now I want to
update the hardware table which is iommu_table attached to LIOBN.

I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.

Now VFIO-KVM device needs to extract iommu_table's from the container.
These iommu_table pointers are stored in "struct tce_container" which is
local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
cannot export and use that.

The other way to go would be adding API to VFIO to enumerate IOMMU groups
in a container and use iommu_table pointers stored in iommu_table_group of
each group (in fact the very first group will be enough as multiple groups
in a container share the table). Adding vfio_container_get_groups() when
only first one is needed is quite tricky in terms of maintainers approvals.

So what would be the right course of action? Thanks.


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-06-09  6:47                       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-09  6:47 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

On 23/03/16 14:03, David Gibson wrote:
> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>> Uff, lost cc: list. Added back. Some comments below.
>>
>>
>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>
>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>
>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>> is added which accepts:
>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>> TCE table;
>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>
>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>>>> finish.
>>>>>>>>>>
>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>>>> space.
>>>>>>>>>>
>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>>>> link to a KVM module.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>
>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>
>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>
>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>> tracking
>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>
>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> -for the VFIO group.
>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>> +			__u32	argsz;
>>>>>>>>>> +			__s32	fd;
>>>>>>>>>> +			__u32	liobn;
>>>>>>>>>> +			__u8	pad[4];
>>>>>>>>>> +			__u64	start_addr;
>>>>>>>>>> +		};
>>>>>>>>>> +		where
>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>
>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>  	select KVM
>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>>>  	---help---
>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>
>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>>>
>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>> itself should be available always.
>>>>>>>>
>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>  	book3s_xics.o
>>>>>>>>>>
>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>>>> +
>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>  	$(KVM)/kvm_main.o \
>>>>>>>>>>  	$(KVM)/eventfd.o \
>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>> *kvm, long ext)
>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>>>
>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>>>>>>  };
>>>>>>>>>>
>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>> +	__u32	argsz;
>>>>>>>>>> +	__s32	fd;
>>>>>>>>>> +	__u32	liobn;
>>>>>>>>>> +	__u8	pad[4];
>>>>>>>>>> +	__u64	start_addr;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>>  /*
>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>   */
>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>  	struct list_head node;
>>>>>>>>>>  	struct vfio_group *vfio_group;
>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>>>> +{
>>>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>>>> +	int ret = -1;
>>>>>>>>>
>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>
>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>> +	if (!fn)
>>>>>>>>>> +		return ret;
>>>>>>>>>> +
>>>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>>>> +
>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>>>  {
>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>> kvm_device *dev)
>>>>>>>>>>  	mutex_unlock(&kv->lock);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>> function?
>>>>>>>>
>>>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>>>> table".
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +{
>>>>>>>>>> +	int group_id;
>>>>>>>>>> +	struct iommu_group *grp;
>>>>>>>>>> +
>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>>>> +	if (grp) {
>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>> +		iommu_group_put(grp);
>>>>>>>>>> +	}
>>>>>>>>>> +}
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>>>  {
>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>  				continue;
>>>>>>>>>>
>>>>>>>>>>  			list_del(&kvg->node);
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>
>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>>>> callsite.
>>>>>>>>
>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>> confused.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>> +					kvg->vfio_group);
>>>>>>>>>> +#endif
>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>  			kfree(kvg);
>>>>>>>>>>  			ret = 0;
>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>>>>>>
>>>>>>>>>>  		return ret;
>>>>>>>>>> +
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>> +		unsigned long minsz;
>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>>>> +		struct fd f;
>>>>>>>>>> +
>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>> +				start_addr);
>>>>>>>>>> +
>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>> +			return -EFAULT;
>>>>>>>>>> +
>>>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>>>> +			return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> +		f = fdget(param.fd);
>>>>>>>>>> +		if (!f.file)
>>>>>>>>>> +			return -EBADF;
>>>>>>>>>> +
>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>> +		fdput(f);
>>>>>>>>>> +
>>>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>>>> +
>>>>>>>>>> +		ret = -ENOENT;
>>>>>>>>>
>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>> IOMMU types.
>>>>>>>>
>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>>>> groups.
>>>>>>>
>>>>>>> Which, as mentioned previously, is broken.
>>>>>>
>>>>>> Which I am failing to follow you on this.
>>>>>>
>>>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>>>> cannot be reused. Plus LIOBNs.
>>>>>
>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>> to group.
>>>>
>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>>>> Pass container fd and then implement new api to lock containers somehow and
>>>
>>> I'm not really understanding what the question is about locking containers.
>>>
>>>> enumerate groups when updating TCE table (including real mode)?
>>>
>>> Why do you need to enumerate groups?  The groups within the container
>>> share a TCE table, so can't you just update that once?
>>
>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>> invalidate) register address, it is still per PE but this does not matter
>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>> the picture.
> 
> True, you'll need to enumerate the groups for invalidates.  But you
> need that already, right.
> 
>>>> Plus new API when we remove a group from a container as the result of guest
>>>> PCI hot unplug?
>>>
>>> I assume you mean a kernel internal API, since it shouldn't need
>>> anything else visible to userspace.  Won't this happen naturally?
>>> When the group is removed from the container, it will get its own TCE
>>> table instead of the previously shared one.
>>>
>>>>>> Passing a container fd does not make much
>>>>>> sense here as the VFIO device would walk through groups, reference them and
>>>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>>>> need to teach KVM about containers.
>>>>>>
>>>>>> What do I miss now?
>>>>>
>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>> LIOBNs map to containers, not groups.
>>>>
>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>> can be one or many or none containers per liobn.
>>>
>>> Ah, true.
>>
>> So I need to add new kernel API for KVM to get table(s) from VFIO
>> container(s). And invent some locking mechanism to prevent table(s) (or
>> associated container(s)) from going away, like
>> vfio_group_get_external_user/vfio_group_put_external_user but for
>> containers. Correct?
> 
> Well, the container is attached to an fd, so if you get a reference on
> the file* that should do it.

I am still trying to think of how to implement this suggestion.

I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
interface as it knows nothing about KVM. There is VFIO-KVM device but it
does not have idea about containers.

So I have to:

Wenever a container is created or removed, notify the VFIO-KVM device by
passing there a container fd. ok.

Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
what LIOBN so the realmode handlers could do the job. The real mode TCE
handlers get LIOBN, find a guest view table and update it. Now I want to
update the hardware table which is iommu_table attached to LIOBN.

I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.

Now VFIO-KVM device needs to extract iommu_table's from the container.
These iommu_table pointers are stored in "struct tce_container" which is
local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
cannot export and use that.

The other way to go would be adding API to VFIO to enumerate IOMMU groups
in a container and use iommu_table pointers stored in iommu_table_group of
each group (in fact the very first group will be enough as multiple groups
in a container share the table). Adding vfio_container_get_groups() when
only first one is needed is quite tricky in terms of maintainers approvals.

So what would be the right course of action? Thanks.


-- 
Alexey

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-06-09  6:47                       ` Alexey Kardashevskiy
@ 2016-06-10  6:50                         ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-06-10  6:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 20685 bytes --]

On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
> On 23/03/16 14:03, David Gibson wrote:
> > On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> >> Uff, lost cc: list. Added back. Some comments below.
> >>
> >>
> >> On 03/21/2016 04:19 PM, David Gibson wrote:
> >>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>
> >>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>>> linked to IOMMU groups by the user space.
> >>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>>>
> >>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>>> is added which accepts:
> >>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>>> TCE table;
> >>>>>>>>>> - a LIOBN to assign to the found table.
> >>>>>>>>>>
> >>>>>>>>>> Before notifying KVM about new link, this check the group for being
> >>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>>> finish.
> >>>>>>>>>>
> >>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>>> space.
> >>>>>>>>>>
> >>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>>> link to a KVM module.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>> ---
> >>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> index ef51740..c0d3eb7 100644
> >>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
> >>>>>>>>>>
> >>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>
> >>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>>> case it might be clearer to put them in a separate patch.
> >>>>>>>>>
> >>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>>> tracking
> >>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>
> >>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> -for the VFIO group.
> >>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>>> +	kvm_device_attr.addr points to a struct:
> >>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>> +			__u32	argsz;
> >>>>>>>>>> +			__s32	fd;
> >>>>>>>>>> +			__u32	liobn;
> >>>>>>>>>> +			__u8	pad[4];
> >>>>>>>>>> +			__u64	start_addr;
> >>>>>>>>>> +		};
> >>>>>>>>>> +		where
> >>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
> >>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>>>
> >>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>>> and 64bit windows, this is why @start_addr is there.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> index 1059846..dfa3488 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>>>  	select KVM
> >>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>>> +	select KVM_VFIO if VFIO
> >>>>>>>>>>  	---help---
> >>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>>> index 7f7b6d8..71f577c 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
> >>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>>>
> >>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>>> +		$(KVM)/eventfd.o
> >>>>>>>>>
> >>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>>> it (even though it didn't do anything until now) so that libvirt
> >>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>>> itself should be available always.
> >>>>>>>>
> >>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>>> @@ -87,6 +87,9 @@ endif
> >>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>>>  	book3s_xics.o
> >>>>>>>>>>
> >>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>>> +	$(KVM)/vfio.o \
> >>>>>>>>>> +
> >>>>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> index 19aa59b..63f188d 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>>> *kvm, long ext)
> >>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>>> index 080ffbf..f1abbea 100644
> >>>>>>>>>> --- a/include/uapi/linux/kvm.h
> >>>>>>>>>> +++ b/include/uapi/linux/kvm.h
> >>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>>>
> >>>>>>>>>>  enum kvm_device_type {
> >>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>>>  };
> >>>>>>>>>>
> >>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>> +	__u32	argsz;
> >>>>>>>>>> +	__s32	fd;
> >>>>>>>>>> +	__u32	liobn;
> >>>>>>>>>> +	__u8	pad[4];
> >>>>>>>>>> +	__u64	start_addr;
> >>>>>>>>>> +};
> >>>>>>>>>> +
> >>>>>>>>>>  /*
> >>>>>>>>>>   * ioctls for VM fds
> >>>>>>>>>>   */
> >>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>>> index 1dd087d..87c771e 100644
> >>>>>>>>>> --- a/virt/kvm/vfio.c
> >>>>>>>>>> +++ b/virt/kvm/vfio.c
> >>>>>>>>>> @@ -20,6 +20,10 @@
> >>>>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>>>  #include "vfio.h"
> >>>>>>>>>>
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +#include <asm/kvm_ppc.h>
> >>>>>>>>>> +#endif
> >>>>>>>>>> +
> >>>>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>>>  	struct list_head node;
> >>>>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>>> @@ -60,6 +64,22 @@ static void
> >>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>>>  }
> >>>>>>>>>>
> >>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>>> +{
> >>>>>>>>>> +	int (*fn)(struct vfio_group *);
> >>>>>>>>>> +	int ret = -1;
> >>>>>>>>>
> >>>>>>>>> Should this be -ESOMETHING?
> >>>>>>>>>
> >>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>>> +	if (!fn)
> >>>>>>>>>> +		return ret;
> >>>>>>>>>> +
> >>>>>>>>>> +	ret = fn(vfio_group);
> >>>>>>>>>> +
> >>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>>> +
> >>>>>>>>>> +	return ret;
> >>>>>>>>>> +}
> >>>>>>>>>> +
> >>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>>>  {
> >>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>>> kvm_device *dev)
> >>>>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>>>  }
> >>>>>>>>>>
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>>> +		struct vfio_group *vfio_group)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Shouldn't this go in the same patch that introduced the attach
> >>>>>>>>> function?
> >>>>>>>>
> >>>>>>>> Having less patches which touch different maintainers areas is better. I
> >>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>>> table".
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +{
> >>>>>>>>>> +	int group_id;
> >>>>>>>>>> +	struct iommu_group *grp;
> >>>>>>>>>> +
> >>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>>> +	if (grp) {
> >>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>>> +		iommu_group_put(grp);
> >>>>>>>>>> +	}
> >>>>>>>>>> +}
> >>>>>>>>>> +#endif
> >>>>>>>>>> +
> >>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>>>  {
> >>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>  				continue;
> >>>>>>>>>>
> >>>>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>
> >>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>>> callsite.
> >>>>>>>>
> >>>>>>>> It is questionable. A x86 reader may decide that
> >>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>>> confused.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>>> +					kvg->vfio_group);
> >>>>>>>>>> +#endif
> >>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>>>  			kfree(kvg);
> >>>>>>>>>>  			ret = 0;
> >>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>>>
> >>>>>>>>>>  		return ret;
> >>>>>>>>>> +
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>>> +		unsigned long minsz;
> >>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
> >>>>>>>>>> +		struct vfio_group *vfio_group;
> >>>>>>>>>> +		struct kvm_vfio_group *kvg;
> >>>>>>>>>> +		struct fd f;
> >>>>>>>>>> +
> >>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>>> +				start_addr);
> >>>>>>>>>> +
> >>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>>> +			return -EFAULT;
> >>>>>>>>>> +
> >>>>>>>>>> +		if (param.argsz < minsz)
> >>>>>>>>>> +			return -EINVAL;
> >>>>>>>>>> +
> >>>>>>>>>> +		f = fdget(param.fd);
> >>>>>>>>>> +		if (!f.file)
> >>>>>>>>>> +			return -EBADF;
> >>>>>>>>>> +
> >>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>>> +		fdput(f);
> >>>>>>>>>> +
> >>>>>>>>>> +		if (IS_ERR(vfio_group))
> >>>>>>>>>> +			return PTR_ERR(vfio_group);
> >>>>>>>>>> +
> >>>>>>>>>> +		ret = -ENOENT;
> >>>>>>>>>
> >>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>>> possible a kernel could be built for a platform supporting multiple
> >>>>>>>>> IOMMU types.
> >>>>>>>>
> >>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>>> VFIO container property, not a group property and here (KVM) we only have
> >>>>>>>> groups.
> >>>>>>>
> >>>>>>> Which, as mentioned previously, is broken.
> >>>>>>
> >>>>>> Which I am failing to follow you on this.
> >>>>>>
> >>>>>> What I am trying to achieve here is pretty much referencing a group so it
> >>>>>> cannot be reused. Plus LIOBNs.
> >>>>>
> >>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>>> (guest) LIOBN affects a group it must affect all groups in the
> >>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>>> to group.
> >>>>
> >>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>>> Pass container fd and then implement new api to lock containers somehow and
> >>>
> >>> I'm not really understanding what the question is about locking containers.
> >>>
> >>>> enumerate groups when updating TCE table (including real mode)?
> >>>
> >>> Why do you need to enumerate groups?  The groups within the container
> >>> share a TCE table, so can't you just update that once?
> >>
> >> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> >> invalidate) register address, it is still per PE but this does not matter
> >> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> >> the picture.
> > 
> > True, you'll need to enumerate the groups for invalidates.  But you
> > need that already, right.
> > 
> >>>> Plus new API when we remove a group from a container as the result of guest
> >>>> PCI hot unplug?
> >>>
> >>> I assume you mean a kernel internal API, since it shouldn't need
> >>> anything else visible to userspace.  Won't this happen naturally?
> >>> When the group is removed from the container, it will get its own TCE
> >>> table instead of the previously shared one.
> >>>
> >>>>>> Passing a container fd does not make much
> >>>>>> sense here as the VFIO device would walk through groups, reference them and
> >>>>>> that is it, there is no locking on VFIO containters and so far there was no
> >>>>>> need to teach KVM about containers.
> >>>>>>
> >>>>>> What do I miss now?
> >>>>>
> >>>>> Referencing the groups is essentially just a useful side effect.  The
> >>>>> important functionality is informing VFIO of the guest LIOBNs; and
> >>>>> LIOBNs map to containers, not groups.
> >>>>
> >>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>>> can be one or many or none containers per liobn.
> >>>
> >>> Ah, true.
> >>
> >> So I need to add new kernel API for KVM to get table(s) from VFIO
> >> container(s). And invent some locking mechanism to prevent table(s) (or
> >> associated container(s)) from going away, like
> >> vfio_group_get_external_user/vfio_group_put_external_user but for
> >> containers. Correct?
> > 
> > Well, the container is attached to an fd, so if you get a reference on
> > the file* that should do it.
> 
> I am still trying to think of how to implement this suggestion.
> 
> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
> interface as it knows nothing about KVM. There is VFIO-KVM device but it
> does not have idea about containers.
> 
> So I have to:
> 
> Wenever a container is created or removed, notify the VFIO-KVM device by
> passing there a container fd. ok.

Actually, I don't think the vfio-kvm device is really useful here.  It
was designed as a hack for a particular problem on x86.  It certainly
could be extended to cover the information we need here, but I don't
think it's a particularly natural way of doing so.

The problem is that conveying the information from the vfio-kvm device
to the real mode H_PUT_TCE handler, which is what really needs it,
isn't particularly simpler than conveying that information from
anywhere else.

> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
> what LIOBN so the realmode handlers could do the job. The real mode TCE
> handlers get LIOBN, find a guest view table and update it. Now I want to
> update the hardware table which is iommu_table attached to LIOBN.
> 
> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
> 
> Now VFIO-KVM device needs to extract iommu_table's from the container.
> These iommu_table pointers are stored in "struct tce_container" which is
> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
> cannot export and use that.
> 
> The other way to go would be adding API to VFIO to enumerate IOMMU groups
> in a container and use iommu_table pointers stored in iommu_table_group of
> each group (in fact the very first group will be enough as multiple groups
> in a container share the table). Adding vfio_container_get_groups() when
> only first one is needed is quite tricky in terms of maintainers approvals.
> 
> So what would be the right course of action? Thanks.

So, from the user side, you need to be able to bind a VFIO backend to
a particular guest IOMMU.  This suggests a new ioctl() used in
conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
kernel aware of a LIOBN in the first place, then use
KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
need a capability to go with it, and some way to unbind as well.

To implement that, the ioctl() would need to use a new vfio (kernel
internal) interface - which can be specific to only the spapr TCE
type.  That would take the container fd, and return the list of
iommu_tables in some form or other (or various error conditions,
obviously).

So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
LIOBN.  The ioctl() implementation uses the new special interface into
the spapr_tce vfio backend to get the list of iommu tables, which it
stores in some private format.  The H_PUT_TCE implementation uses that
stored list of iommu tables to translate H_PUT_TCEs from the guest
into actions on the host IOMMU tables.

And, yes, the special interface to the spapr TCE vfio back end is kind
of a hack.  That's what you get when you need to link to separate
kernel subsystems for performance reasons.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-06-10  6:50                         ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-06-10  6:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 20685 bytes --]

On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
> On 23/03/16 14:03, David Gibson wrote:
> > On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> >> Uff, lost cc: list. Added back. Some comments below.
> >>
> >>
> >> On 03/21/2016 04:19 PM, David Gibson wrote:
> >>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>
> >>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>>> linked to IOMMU groups by the user space.
> >>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>>>
> >>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>>> is added which accepts:
> >>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>>> TCE table;
> >>>>>>>>>> - a LIOBN to assign to the found table.
> >>>>>>>>>>
> >>>>>>>>>> Before notifying KVM about new link, this check the group for being
> >>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>>> finish.
> >>>>>>>>>>
> >>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>>> space.
> >>>>>>>>>>
> >>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>>> link to a KVM module.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>> ---
> >>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> index ef51740..c0d3eb7 100644
> >>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
> >>>>>>>>>>
> >>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>
> >>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>>> case it might be clearer to put them in a separate patch.
> >>>>>>>>>
> >>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>>> tracking
> >>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>
> >>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> -for the VFIO group.
> >>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>>> +	kvm_device_attr.addr points to a struct:
> >>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>> +			__u32	argsz;
> >>>>>>>>>> +			__s32	fd;
> >>>>>>>>>> +			__u32	liobn;
> >>>>>>>>>> +			__u8	pad[4];
> >>>>>>>>>> +			__u64	start_addr;
> >>>>>>>>>> +		};
> >>>>>>>>>> +		where
> >>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
> >>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>>>
> >>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>>> and 64bit windows, this is why @start_addr is there.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> index 1059846..dfa3488 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>>>  	select KVM
> >>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>>> +	select KVM_VFIO if VFIO
> >>>>>>>>>>  	---help---
> >>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>>> index 7f7b6d8..71f577c 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
> >>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>>>
> >>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>>> +		$(KVM)/eventfd.o
> >>>>>>>>>
> >>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>>> it (even though it didn't do anything until now) so that libvirt
> >>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>>> itself should be available always.
> >>>>>>>>
> >>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>>> @@ -87,6 +87,9 @@ endif
> >>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>>>  	book3s_xics.o
> >>>>>>>>>>
> >>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>>> +	$(KVM)/vfio.o \
> >>>>>>>>>> +
> >>>>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> index 19aa59b..63f188d 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>>> *kvm, long ext)
> >>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>>> index 080ffbf..f1abbea 100644
> >>>>>>>>>> --- a/include/uapi/linux/kvm.h
> >>>>>>>>>> +++ b/include/uapi/linux/kvm.h
> >>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>>>
> >>>>>>>>>>  enum kvm_device_type {
> >>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>>>  };
> >>>>>>>>>>
> >>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>> +	__u32	argsz;
> >>>>>>>>>> +	__s32	fd;
> >>>>>>>>>> +	__u32	liobn;
> >>>>>>>>>> +	__u8	pad[4];
> >>>>>>>>>> +	__u64	start_addr;
> >>>>>>>>>> +};
> >>>>>>>>>> +
> >>>>>>>>>>  /*
> >>>>>>>>>>   * ioctls for VM fds
> >>>>>>>>>>   */
> >>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>>> index 1dd087d..87c771e 100644
> >>>>>>>>>> --- a/virt/kvm/vfio.c
> >>>>>>>>>> +++ b/virt/kvm/vfio.c
> >>>>>>>>>> @@ -20,6 +20,10 @@
> >>>>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>>>  #include "vfio.h"
> >>>>>>>>>>
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +#include <asm/kvm_ppc.h>
> >>>>>>>>>> +#endif
> >>>>>>>>>> +
> >>>>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>>>  	struct list_head node;
> >>>>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>>> @@ -60,6 +64,22 @@ static void
> >>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>>>  }
> >>>>>>>>>>
> >>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>>> +{
> >>>>>>>>>> +	int (*fn)(struct vfio_group *);
> >>>>>>>>>> +	int ret = -1;
> >>>>>>>>>
> >>>>>>>>> Should this be -ESOMETHING?
> >>>>>>>>>
> >>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>>> +	if (!fn)
> >>>>>>>>>> +		return ret;
> >>>>>>>>>> +
> >>>>>>>>>> +	ret = fn(vfio_group);
> >>>>>>>>>> +
> >>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>>> +
> >>>>>>>>>> +	return ret;
> >>>>>>>>>> +}
> >>>>>>>>>> +
> >>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>>>  {
> >>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>>> kvm_device *dev)
> >>>>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>>>  }
> >>>>>>>>>>
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>>> +		struct vfio_group *vfio_group)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Shouldn't this go in the same patch that introduced the attach
> >>>>>>>>> function?
> >>>>>>>>
> >>>>>>>> Having less patches which touch different maintainers areas is better. I
> >>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>>> table".
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +{
> >>>>>>>>>> +	int group_id;
> >>>>>>>>>> +	struct iommu_group *grp;
> >>>>>>>>>> +
> >>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>>> +	if (grp) {
> >>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>>> +		iommu_group_put(grp);
> >>>>>>>>>> +	}
> >>>>>>>>>> +}
> >>>>>>>>>> +#endif
> >>>>>>>>>> +
> >>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>>>  {
> >>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>  				continue;
> >>>>>>>>>>
> >>>>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>
> >>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>>> callsite.
> >>>>>>>>
> >>>>>>>> It is questionable. A x86 reader may decide that
> >>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>>> confused.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>>> +					kvg->vfio_group);
> >>>>>>>>>> +#endif
> >>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>>>  			kfree(kvg);
> >>>>>>>>>>  			ret = 0;
> >>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>>>
> >>>>>>>>>>  		return ret;
> >>>>>>>>>> +
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>>> +		unsigned long minsz;
> >>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
> >>>>>>>>>> +		struct vfio_group *vfio_group;
> >>>>>>>>>> +		struct kvm_vfio_group *kvg;
> >>>>>>>>>> +		struct fd f;
> >>>>>>>>>> +
> >>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>>> +				start_addr);
> >>>>>>>>>> +
> >>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>>> +			return -EFAULT;
> >>>>>>>>>> +
> >>>>>>>>>> +		if (param.argsz < minsz)
> >>>>>>>>>> +			return -EINVAL;
> >>>>>>>>>> +
> >>>>>>>>>> +		f = fdget(param.fd);
> >>>>>>>>>> +		if (!f.file)
> >>>>>>>>>> +			return -EBADF;
> >>>>>>>>>> +
> >>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>>> +		fdput(f);
> >>>>>>>>>> +
> >>>>>>>>>> +		if (IS_ERR(vfio_group))
> >>>>>>>>>> +			return PTR_ERR(vfio_group);
> >>>>>>>>>> +
> >>>>>>>>>> +		ret = -ENOENT;
> >>>>>>>>>
> >>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>>> possible a kernel could be built for a platform supporting multiple
> >>>>>>>>> IOMMU types.
> >>>>>>>>
> >>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>>> VFIO container property, not a group property and here (KVM) we only have
> >>>>>>>> groups.
> >>>>>>>
> >>>>>>> Which, as mentioned previously, is broken.
> >>>>>>
> >>>>>> Which I am failing to follow you on this.
> >>>>>>
> >>>>>> What I am trying to achieve here is pretty much referencing a group so it
> >>>>>> cannot be reused. Plus LIOBNs.
> >>>>>
> >>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>>> (guest) LIOBN affects a group it must affect all groups in the
> >>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>>> to group.
> >>>>
> >>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>>> Pass container fd and then implement new api to lock containers somehow and
> >>>
> >>> I'm not really understanding what the question is about locking containers.
> >>>
> >>>> enumerate groups when updating TCE table (including real mode)?
> >>>
> >>> Why do you need to enumerate groups?  The groups within the container
> >>> share a TCE table, so can't you just update that once?
> >>
> >> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> >> invalidate) register address, it is still per PE but this does not matter
> >> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> >> the picture.
> > 
> > True, you'll need to enumerate the groups for invalidates.  But you
> > need that already, right.
> > 
> >>>> Plus new API when we remove a group from a container as the result of guest
> >>>> PCI hot unplug?
> >>>
> >>> I assume you mean a kernel internal API, since it shouldn't need
> >>> anything else visible to userspace.  Won't this happen naturally?
> >>> When the group is removed from the container, it will get its own TCE
> >>> table instead of the previously shared one.
> >>>
> >>>>>> Passing a container fd does not make much
> >>>>>> sense here as the VFIO device would walk through groups, reference them and
> >>>>>> that is it, there is no locking on VFIO containters and so far there was no
> >>>>>> need to teach KVM about containers.
> >>>>>>
> >>>>>> What do I miss now?
> >>>>>
> >>>>> Referencing the groups is essentially just a useful side effect.  The
> >>>>> important functionality is informing VFIO of the guest LIOBNs; and
> >>>>> LIOBNs map to containers, not groups.
> >>>>
> >>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>>> can be one or many or none containers per liobn.
> >>>
> >>> Ah, true.
> >>
> >> So I need to add new kernel API for KVM to get table(s) from VFIO
> >> container(s). And invent some locking mechanism to prevent table(s) (or
> >> associated container(s)) from going away, like
> >> vfio_group_get_external_user/vfio_group_put_external_user but for
> >> containers. Correct?
> > 
> > Well, the container is attached to an fd, so if you get a reference on
> > the file* that should do it.
> 
> I am still trying to think of how to implement this suggestion.
> 
> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
> interface as it knows nothing about KVM. There is VFIO-KVM device but it
> does not have idea about containers.
> 
> So I have to:
> 
> Wenever a container is created or removed, notify the VFIO-KVM device by
> passing there a container fd. ok.

Actually, I don't think the vfio-kvm device is really useful here.  It
was designed as a hack for a particular problem on x86.  It certainly
could be extended to cover the information we need here, but I don't
think it's a particularly natural way of doing so.

The problem is that conveying the information from the vfio-kvm device
to the real mode H_PUT_TCE handler, which is what really needs it,
isn't particularly simpler than conveying that information from
anywhere else.

> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
> what LIOBN so the realmode handlers could do the job. The real mode TCE
> handlers get LIOBN, find a guest view table and update it. Now I want to
> update the hardware table which is iommu_table attached to LIOBN.
> 
> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
> 
> Now VFIO-KVM device needs to extract iommu_table's from the container.
> These iommu_table pointers are stored in "struct tce_container" which is
> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
> cannot export and use that.
> 
> The other way to go would be adding API to VFIO to enumerate IOMMU groups
> in a container and use iommu_table pointers stored in iommu_table_group of
> each group (in fact the very first group will be enough as multiple groups
> in a container share the table). Adding vfio_container_get_groups() when
> only first one is needed is quite tricky in terms of maintainers approvals.
> 
> So what would be the right course of action? Thanks.

So, from the user side, you need to be able to bind a VFIO backend to
a particular guest IOMMU.  This suggests a new ioctl() used in
conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
kernel aware of a LIOBN in the first place, then use
KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
need a capability to go with it, and some way to unbind as well.

To implement that, the ioctl() would need to use a new vfio (kernel
internal) interface - which can be specific to only the spapr TCE
type.  That would take the container fd, and return the list of
iommu_tables in some form or other (or various error conditions,
obviously).

So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
LIOBN.  The ioctl() implementation uses the new special interface into
the spapr_tce vfio backend to get the list of iommu tables, which it
stores in some private format.  The H_PUT_TCE implementation uses that
stored list of iommu tables to translate H_PUT_TCEs from the guest
into actions on the host IOMMU tables.

And, yes, the special interface to the spapr TCE vfio back end is kind
of a hack.  That's what you get when you need to link to separate
kernel subsystems for performance reasons.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-06-10  6:50                         ` David Gibson
@ 2016-06-14  3:30                           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-14  3:30 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 21081 bytes --]

On 10/06/16 16:50, David Gibson wrote:
> On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
>> On 23/03/16 14:03, David Gibson wrote:
>>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>>>> Uff, lost cc: list. Added back. Some comments below.
>>>>
>>>>
>>>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>>>> is added which accepts:
>>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>>>> TCE table;
>>>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>>>
>>>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>>>>>> finish.
>>>>>>>>>>>>
>>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>>>>>> space.
>>>>>>>>>>>>
>>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>>>>>> link to a KVM module.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>>>
>>>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>>
>>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>>>
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>>>> tracking
>>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>>>
>>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> -for the VFIO group.
>>>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +			__u32	argsz;
>>>>>>>>>>>> +			__s32	fd;
>>>>>>>>>>>> +			__u32	liobn;
>>>>>>>>>>>> +			__u8	pad[4];
>>>>>>>>>>>> +			__u64	start_addr;
>>>>>>>>>>>> +		};
>>>>>>>>>>>> +		where
>>>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>>>
>>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>>>  	select KVM
>>>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>>>>>  	---help---
>>>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>>>
>>>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>>>>>
>>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>>>> itself should be available always.
>>>>>>>>>>
>>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>>>  	book3s_xics.o
>>>>>>>>>>>>
>>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>>>>>> +
>>>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>>>  	$(KVM)/kvm_main.o \
>>>>>>>>>>>>  	$(KVM)/eventfd.o \
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>>>> *kvm, long ext)
>>>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>>>>>
>>>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>>>>>>>>  };
>>>>>>>>>>>>
>>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +	__u32	argsz;
>>>>>>>>>>>> +	__s32	fd;
>>>>>>>>>>>> +	__u32	liobn;
>>>>>>>>>>>> +	__u8	pad[4];
>>>>>>>>>>>> +	__u64	start_addr;
>>>>>>>>>>>> +};
>>>>>>>>>>>> +
>>>>>>>>>>>>  /*
>>>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>>>   */
>>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>>>  	struct list_head node;
>>>>>>>>>>>>  	struct vfio_group *vfio_group;
>>>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>>>>>> +	int ret = -1;
>>>>>>>>>>>
>>>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>>>
>>>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>>>> +	if (!fn)
>>>>>>>>>>>> +		return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	return ret;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>>>> kvm_device *dev)
>>>>>>>>>>>>  	mutex_unlock(&kv->lock);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>>>> function?
>>>>>>>>>>
>>>>>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>>>>>> table".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	int group_id;
>>>>>>>>>>>> +	struct iommu_group *grp;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>>>>>> +	if (grp) {
>>>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>>>> +		iommu_group_put(grp);
>>>>>>>>>>>> +	}
>>>>>>>>>>>> +}
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>  				continue;
>>>>>>>>>>>>
>>>>>>>>>>>>  			list_del(&kvg->node);
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>
>>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>>>>>> callsite.
>>>>>>>>>>
>>>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>>>> confused.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>>>> +					kvg->vfio_group);
>>>>>>>>>>>> +#endif
>>>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>>>  			kfree(kvg);
>>>>>>>>>>>>  			ret = 0;
>>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>>>>>>>>
>>>>>>>>>>>>  		return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>>>> +		unsigned long minsz;
>>>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>>>>>> +		struct fd f;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>>>> +				start_addr);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>>>> +			return -EFAULT;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>>>>>> +			return -EINVAL;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		f = fdget(param.fd);
>>>>>>>>>>>> +		if (!f.file)
>>>>>>>>>>>> +			return -EBADF;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>>>> +		fdput(f);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		ret = -ENOENT;
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>>>> IOMMU types.
>>>>>>>>>>
>>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>>>>>> groups.
>>>>>>>>>
>>>>>>>>> Which, as mentioned previously, is broken.
>>>>>>>>
>>>>>>>> Which I am failing to follow you on this.
>>>>>>>>
>>>>>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>>>>>> cannot be reused. Plus LIOBNs.
>>>>>>>
>>>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>>>> to group.
>>>>>>
>>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>>>>>> Pass container fd and then implement new api to lock containers somehow and
>>>>>
>>>>> I'm not really understanding what the question is about locking containers.
>>>>>
>>>>>> enumerate groups when updating TCE table (including real mode)?
>>>>>
>>>>> Why do you need to enumerate groups?  The groups within the container
>>>>> share a TCE table, so can't you just update that once?
>>>>
>>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>>>> invalidate) register address, it is still per PE but this does not matter
>>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>>>> the picture.
>>>
>>> True, you'll need to enumerate the groups for invalidates.  But you
>>> need that already, right.
>>>
>>>>>> Plus new API when we remove a group from a container as the result of guest
>>>>>> PCI hot unplug?
>>>>>
>>>>> I assume you mean a kernel internal API, since it shouldn't need
>>>>> anything else visible to userspace.  Won't this happen naturally?
>>>>> When the group is removed from the container, it will get its own TCE
>>>>> table instead of the previously shared one.
>>>>>
>>>>>>>> Passing a container fd does not make much
>>>>>>>> sense here as the VFIO device would walk through groups, reference them and
>>>>>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>>>>>> need to teach KVM about containers.
>>>>>>>>
>>>>>>>> What do I miss now?
>>>>>>>
>>>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>>>> LIOBNs map to containers, not groups.
>>>>>>
>>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>>>> can be one or many or none containers per liobn.
>>>>>
>>>>> Ah, true.
>>>>
>>>> So I need to add new kernel API for KVM to get table(s) from VFIO
>>>> container(s). And invent some locking mechanism to prevent table(s) (or
>>>> associated container(s)) from going away, like
>>>> vfio_group_get_external_user/vfio_group_put_external_user but for
>>>> containers. Correct?
>>>
>>> Well, the container is attached to an fd, so if you get a reference on
>>> the file* that should do it.
>>
>> I am still trying to think of how to implement this suggestion.
>>
>> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
>> interface as it knows nothing about KVM. There is VFIO-KVM device but it
>> does not have idea about containers.
>>
>> So I have to:
>>
>> Wenever a container is created or removed, notify the VFIO-KVM device by
>> passing there a container fd. ok.
> 
> Actually, I don't think the vfio-kvm device is really useful here.  It
> was designed as a hack for a particular problem on x86.  It certainly
> could be extended to cover the information we need here, but I don't
> think it's a particularly natural way of doing so.
> 
> The problem is that conveying the information from the vfio-kvm device
> to the real mode H_PUT_TCE handler, which is what really needs it,
> isn't particularly simpler than conveying that information from
> anywhere else.
> 
>> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
>> what LIOBN so the realmode handlers could do the job. The real mode TCE
>> handlers get LIOBN, find a guest view table and update it. Now I want to
>> update the hardware table which is iommu_table attached to LIOBN.
>>
>> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
>>
>> Now VFIO-KVM device needs to extract iommu_table's from the container.
>> These iommu_table pointers are stored in "struct tce_container" which is
>> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
>> cannot export and use that.
>>
>> The other way to go would be adding API to VFIO to enumerate IOMMU groups
>> in a container and use iommu_table pointers stored in iommu_table_group of
>> each group (in fact the very first group will be enough as multiple groups
>> in a container share the table). Adding vfio_container_get_groups() when
>> only first one is needed is quite tricky in terms of maintainers approvals.
>>
>> So what would be the right course of action? Thanks.
> 
> So, from the user side, you need to be able to bind a VFIO backend to
> a particular guest IOMMU.  This suggests a new ioctl() used in
> conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
> KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
> kernel aware of a LIOBN in the first place, then use
> KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
> So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
> need a capability to go with it, and some way to unbind as well.


This is what I had in the first place some years ago. And after 5-6 reviews
I was told that there is a VFIO KVM and I should use it.


> To implement that, the ioctl() would need to use a new vfio (kernel
> internal) interface - which can be specific to only the spapr TCE
> type.  That would take the container fd, and return the list of
> iommu_tables in some form or other (or various error conditions,
> obviously).
> 
> So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
> the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
> it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
> LIOBN.  The ioctl() implementation uses the new special interface into
> the spapr_tce vfio backend to get the list of iommu tables, which it
> stores in some private format.

Getting just a list of IOMMU groups would do too. Pushing such API is a
problem, this is how I ended up with the current design.


> The H_PUT_TCE implementation uses that
> stored list of iommu tables to translate H_PUT_TCEs from the guest
> into actions on the host IOMMU tables.
> 
> And, yes, the special interface to the spapr TCE vfio back end is kind
> of a hack.  That's what you get when you need to link to separate
> kernel subsystems for performance reasons.

One can argue if it is a hack, how is this hack better that the existing
approach? :)

Alex, could you please comment on David's suggestion? Thanks!


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-06-14  3:30                           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 92+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-14  3:30 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm


[-- Attachment #1.1: Type: text/plain, Size: 21081 bytes --]

On 10/06/16 16:50, David Gibson wrote:
> On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
>> On 23/03/16 14:03, David Gibson wrote:
>>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>>>> Uff, lost cc: list. Added back. Some comments below.
>>>>
>>>>
>>>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>>>> is added which accepts:
>>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>>>> TCE table;
>>>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>>>
>>>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>>>>>> finish.
>>>>>>>>>>>>
>>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>>>>>> space.
>>>>>>>>>>>>
>>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>>>>>> link to a KVM module.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>>>
>>>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>>
>>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>>>
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>>>> tracking
>>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>>>
>>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> -for the VFIO group.
>>>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +			__u32	argsz;
>>>>>>>>>>>> +			__s32	fd;
>>>>>>>>>>>> +			__u32	liobn;
>>>>>>>>>>>> +			__u8	pad[4];
>>>>>>>>>>>> +			__u64	start_addr;
>>>>>>>>>>>> +		};
>>>>>>>>>>>> +		where
>>>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>>>
>>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>>>  	select KVM
>>>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>>>>>  	---help---
>>>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>>>
>>>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>>>>>
>>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>>>> itself should be available always.
>>>>>>>>>>
>>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>>>  	book3s_xics.o
>>>>>>>>>>>>
>>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>>>>>> +
>>>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>>>  	$(KVM)/kvm_main.o \
>>>>>>>>>>>>  	$(KVM)/eventfd.o \
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>>>> *kvm, long ext)
>>>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>>>>>
>>>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>>>>>>>>  };
>>>>>>>>>>>>
>>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +	__u32	argsz;
>>>>>>>>>>>> +	__s32	fd;
>>>>>>>>>>>> +	__u32	liobn;
>>>>>>>>>>>> +	__u8	pad[4];
>>>>>>>>>>>> +	__u64	start_addr;
>>>>>>>>>>>> +};
>>>>>>>>>>>> +
>>>>>>>>>>>>  /*
>>>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>>>   */
>>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>>>  	struct list_head node;
>>>>>>>>>>>>  	struct vfio_group *vfio_group;
>>>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>>>>>> +	int ret = -1;
>>>>>>>>>>>
>>>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>>>
>>>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>>>> +	if (!fn)
>>>>>>>>>>>> +		return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	return ret;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>>>> kvm_device *dev)
>>>>>>>>>>>>  	mutex_unlock(&kv->lock);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>>>> function?
>>>>>>>>>>
>>>>>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>>>>>> table".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	int group_id;
>>>>>>>>>>>> +	struct iommu_group *grp;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>>>>>> +	if (grp) {
>>>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>>>> +		iommu_group_put(grp);
>>>>>>>>>>>> +	}
>>>>>>>>>>>> +}
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>  				continue;
>>>>>>>>>>>>
>>>>>>>>>>>>  			list_del(&kvg->node);
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>
>>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>>>>>> callsite.
>>>>>>>>>>
>>>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>>>> confused.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>>>> +					kvg->vfio_group);
>>>>>>>>>>>> +#endif
>>>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>>>  			kfree(kvg);
>>>>>>>>>>>>  			ret = 0;
>>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>>>>>>>>
>>>>>>>>>>>>  		return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>>>> +		unsigned long minsz;
>>>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>>>>>> +		struct fd f;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>>>> +				start_addr);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>>>> +			return -EFAULT;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>>>>>> +			return -EINVAL;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		f = fdget(param.fd);
>>>>>>>>>>>> +		if (!f.file)
>>>>>>>>>>>> +			return -EBADF;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>>>> +		fdput(f);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		ret = -ENOENT;
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>>>> IOMMU types.
>>>>>>>>>>
>>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>>>>>> groups.
>>>>>>>>>
>>>>>>>>> Which, as mentioned previously, is broken.
>>>>>>>>
>>>>>>>> Which I am failing to follow you on this.
>>>>>>>>
>>>>>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>>>>>> cannot be reused. Plus LIOBNs.
>>>>>>>
>>>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>>>> to group.
>>>>>>
>>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>>>>>> Pass container fd and then implement new api to lock containers somehow and
>>>>>
>>>>> I'm not really understanding what the question is about locking containers.
>>>>>
>>>>>> enumerate groups when updating TCE table (including real mode)?
>>>>>
>>>>> Why do you need to enumerate groups?  The groups within the container
>>>>> share a TCE table, so can't you just update that once?
>>>>
>>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>>>> invalidate) register address, it is still per PE but this does not matter
>>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>>>> the picture.
>>>
>>> True, you'll need to enumerate the groups for invalidates.  But you
>>> need that already, right.
>>>
>>>>>> Plus new API when we remove a group from a container as the result of guest
>>>>>> PCI hot unplug?
>>>>>
>>>>> I assume you mean a kernel internal API, since it shouldn't need
>>>>> anything else visible to userspace.  Won't this happen naturally?
>>>>> When the group is removed from the container, it will get its own TCE
>>>>> table instead of the previously shared one.
>>>>>
>>>>>>>> Passing a container fd does not make much
>>>>>>>> sense here as the VFIO device would walk through groups, reference them and
>>>>>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>>>>>> need to teach KVM about containers.
>>>>>>>>
>>>>>>>> What do I miss now?
>>>>>>>
>>>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>>>> LIOBNs map to containers, not groups.
>>>>>>
>>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>>>> can be one or many or none containers per liobn.
>>>>>
>>>>> Ah, true.
>>>>
>>>> So I need to add new kernel API for KVM to get table(s) from VFIO
>>>> container(s). And invent some locking mechanism to prevent table(s) (or
>>>> associated container(s)) from going away, like
>>>> vfio_group_get_external_user/vfio_group_put_external_user but for
>>>> containers. Correct?
>>>
>>> Well, the container is attached to an fd, so if you get a reference on
>>> the file* that should do it.
>>
>> I am still trying to think of how to implement this suggestion.
>>
>> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
>> interface as it knows nothing about KVM. There is VFIO-KVM device but it
>> does not have idea about containers.
>>
>> So I have to:
>>
>> Wenever a container is created or removed, notify the VFIO-KVM device by
>> passing there a container fd. ok.
> 
> Actually, I don't think the vfio-kvm device is really useful here.  It
> was designed as a hack for a particular problem on x86.  It certainly
> could be extended to cover the information we need here, but I don't
> think it's a particularly natural way of doing so.
> 
> The problem is that conveying the information from the vfio-kvm device
> to the real mode H_PUT_TCE handler, which is what really needs it,
> isn't particularly simpler than conveying that information from
> anywhere else.
> 
>> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
>> what LIOBN so the realmode handlers could do the job. The real mode TCE
>> handlers get LIOBN, find a guest view table and update it. Now I want to
>> update the hardware table which is iommu_table attached to LIOBN.
>>
>> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
>>
>> Now VFIO-KVM device needs to extract iommu_table's from the container.
>> These iommu_table pointers are stored in "struct tce_container" which is
>> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
>> cannot export and use that.
>>
>> The other way to go would be adding API to VFIO to enumerate IOMMU groups
>> in a container and use iommu_table pointers stored in iommu_table_group of
>> each group (in fact the very first group will be enough as multiple groups
>> in a container share the table). Adding vfio_container_get_groups() when
>> only first one is needed is quite tricky in terms of maintainers approvals.
>>
>> So what would be the right course of action? Thanks.
> 
> So, from the user side, you need to be able to bind a VFIO backend to
> a particular guest IOMMU.  This suggests a new ioctl() used in
> conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
> KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
> kernel aware of a LIOBN in the first place, then use
> KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
> So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
> need a capability to go with it, and some way to unbind as well.


This is what I had in the first place some years ago. And after 5-6 reviews
I was told that there is a VFIO KVM and I should use it.


> To implement that, the ioctl() would need to use a new vfio (kernel
> internal) interface - which can be specific to only the spapr TCE
> type.  That would take the container fd, and return the list of
> iommu_tables in some form or other (or various error conditions,
> obviously).
> 
> So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
> the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
> it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
> LIOBN.  The ioctl() implementation uses the new special interface into
> the spapr_tce vfio backend to get the list of iommu tables, which it
> stores in some private format.

Getting just a list of IOMMU groups would do too. Pushing such API is a
problem, this is how I ended up with the current design.


> The H_PUT_TCE implementation uses that
> stored list of iommu tables to translate H_PUT_TCEs from the guest
> into actions on the host IOMMU tables.
> 
> And, yes, the special interface to the spapr TCE vfio back end is kind
> of a hack.  That's what you get when you need to link to separate
> kernel subsystems for performance reasons.

One can argue if it is a hack, how is this hack better that the existing
approach? :)

Alex, could you please comment on David's suggestion? Thanks!


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
  2016-06-14  3:30                           ` Alexey Kardashevskiy
@ 2016-06-15  4:43                             ` David Gibson
  -1 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-06-15  4:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 22788 bytes --]

On Tue, Jun 14, 2016 at 01:30:53PM +1000, Alexey Kardashevskiy wrote:
> On 10/06/16 16:50, David Gibson wrote:
> > On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
> >> On 23/03/16 14:03, David Gibson wrote:
> >>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> >>>> Uff, lost cc: list. Added back. Some comments below.
> >>>>
> >>>>
> >>>> On 03/21/2016 04:19 PM, David Gibson wrote:
> >>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>>
> >>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>>>>> linked to IOMMU groups by the user space.
> >>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>>>>> is added which accepts:
> >>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>>>>> TCE table;
> >>>>>>>>>>>> - a LIOBN to assign to the found table.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Before notifying KVM about new link, this check the group for being
> >>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>>>>> finish.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>>>>> space.
> >>>>>>>>>>>>
> >>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>>>>> link to a KVM module.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> index ef51740..c0d3eb7 100644
> >>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
> >>>>>>>>>>>>
> >>>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>>
> >>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>>>>> case it might be clearer to put them in a separate patch.
> >>>>>>>>>>>
> >>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>>>>> tracking
> >>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> -for the VFIO group.
> >>>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
> >>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>>>> +			__u32	argsz;
> >>>>>>>>>>>> +			__s32	fd;
> >>>>>>>>>>>> +			__u32	liobn;
> >>>>>>>>>>>> +			__u8	pad[4];
> >>>>>>>>>>>> +			__u64	start_addr;
> >>>>>>>>>>>> +		};
> >>>>>>>>>>>> +		where
> >>>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
> >>>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>>>>>
> >>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>>>>> and 64bit windows, this is why @start_addr is there.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> index 1059846..dfa3488 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>>>>>  	select KVM
> >>>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>>>>> +	select KVM_VFIO if VFIO
> >>>>>>>>>>>>  	---help---
> >>>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> index 7f7b6d8..71f577c 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>>>>>
> >>>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>>>>> +		$(KVM)/eventfd.o
> >>>>>>>>>>>
> >>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
> >>>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>>>>> itself should be available always.
> >>>>>>>>>>
> >>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>>>>> @@ -87,6 +87,9 @@ endif
> >>>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>>>>>  	book3s_xics.o
> >>>>>>>>>>>>
> >>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>>>>> +	$(KVM)/vfio.o \
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> index 19aa59b..63f188d 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>>>>> *kvm, long ext)
> >>>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>>>>> index 080ffbf..f1abbea 100644
> >>>>>>>>>>>> --- a/include/uapi/linux/kvm.h
> >>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
> >>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>>>>>
> >>>>>>>>>>>>  enum kvm_device_type {
> >>>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>>>>>  };
> >>>>>>>>>>>>
> >>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>>>> +	__u32	argsz;
> >>>>>>>>>>>> +	__s32	fd;
> >>>>>>>>>>>> +	__u32	liobn;
> >>>>>>>>>>>> +	__u8	pad[4];
> >>>>>>>>>>>> +	__u64	start_addr;
> >>>>>>>>>>>> +};
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  /*
> >>>>>>>>>>>>   * ioctls for VM fds
> >>>>>>>>>>>>   */
> >>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>>>>> index 1dd087d..87c771e 100644
> >>>>>>>>>>>> --- a/virt/kvm/vfio.c
> >>>>>>>>>>>> +++ b/virt/kvm/vfio.c
> >>>>>>>>>>>> @@ -20,6 +20,10 @@
> >>>>>>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>>>>>  #include "vfio.h"
> >>>>>>>>>>>>
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +#include <asm/kvm_ppc.h>
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>>>>>  	struct list_head node;
> >>>>>>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>>>>> @@ -60,6 +64,22 @@ static void
> >>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>
> >>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>>>>> +{
> >>>>>>>>>>>> +	int (*fn)(struct vfio_group *);
> >>>>>>>>>>>> +	int ret = -1;
> >>>>>>>>>>>
> >>>>>>>>>>> Should this be -ESOMETHING?
> >>>>>>>>>>>
> >>>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>>>>> +	if (!fn)
> >>>>>>>>>>>> +		return ret;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	ret = fn(vfio_group);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	return ret;
> >>>>>>>>>>>> +}
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>>>>> kvm_device *dev)
> >>>>>>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>>>>> +		struct vfio_group *vfio_group)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
> >>>>>>>>>>> function?
> >>>>>>>>>>
> >>>>>>>>>> Having less patches which touch different maintainers areas is better. I
> >>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>>>>> table".
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> +{
> >>>>>>>>>>>> +	int group_id;
> >>>>>>>>>>>> +	struct iommu_group *grp;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>>>>> +	if (grp) {
> >>>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>>>>> +		iommu_group_put(grp);
> >>>>>>>>>>>> +	}
> >>>>>>>>>>>> +}
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>>>  				continue;
> >>>>>>>>>>>>
> >>>>>>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>
> >>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>>>>> callsite.
> >>>>>>>>>>
> >>>>>>>>>> It is questionable. A x86 reader may decide that
> >>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>>>>> confused.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>>>>> +					kvg->vfio_group);
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>>>>>  			kfree(kvg);
> >>>>>>>>>>>>  			ret = 0;
> >>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>>>>>
> >>>>>>>>>>>>  		return ret;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>>>>> +		unsigned long minsz;
> >>>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
> >>>>>>>>>>>> +		struct vfio_group *vfio_group;
> >>>>>>>>>>>> +		struct kvm_vfio_group *kvg;
> >>>>>>>>>>>> +		struct fd f;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>>>>> +				start_addr);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>>>>> +			return -EFAULT;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (param.argsz < minsz)
> >>>>>>>>>>>> +			return -EINVAL;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		f = fdget(param.fd);
> >>>>>>>>>>>> +		if (!f.file)
> >>>>>>>>>>>> +			return -EBADF;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>>>>> +		fdput(f);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (IS_ERR(vfio_group))
> >>>>>>>>>>>> +			return PTR_ERR(vfio_group);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		ret = -ENOENT;
> >>>>>>>>>>>
> >>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>>>>> possible a kernel could be built for a platform supporting multiple
> >>>>>>>>>>> IOMMU types.
> >>>>>>>>>>
> >>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
> >>>>>>>>>> groups.
> >>>>>>>>>
> >>>>>>>>> Which, as mentioned previously, is broken.
> >>>>>>>>
> >>>>>>>> Which I am failing to follow you on this.
> >>>>>>>>
> >>>>>>>> What I am trying to achieve here is pretty much referencing a group so it
> >>>>>>>> cannot be reused. Plus LIOBNs.
> >>>>>>>
> >>>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>>>>> (guest) LIOBN affects a group it must affect all groups in the
> >>>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>>>>> to group.
> >>>>>>
> >>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>>>>> Pass container fd and then implement new api to lock containers somehow and
> >>>>>
> >>>>> I'm not really understanding what the question is about locking containers.
> >>>>>
> >>>>>> enumerate groups when updating TCE table (including real mode)?
> >>>>>
> >>>>> Why do you need to enumerate groups?  The groups within the container
> >>>>> share a TCE table, so can't you just update that once?
> >>>>
> >>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> >>>> invalidate) register address, it is still per PE but this does not matter
> >>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> >>>> the picture.
> >>>
> >>> True, you'll need to enumerate the groups for invalidates.  But you
> >>> need that already, right.
> >>>
> >>>>>> Plus new API when we remove a group from a container as the result of guest
> >>>>>> PCI hot unplug?
> >>>>>
> >>>>> I assume you mean a kernel internal API, since it shouldn't need
> >>>>> anything else visible to userspace.  Won't this happen naturally?
> >>>>> When the group is removed from the container, it will get its own TCE
> >>>>> table instead of the previously shared one.
> >>>>>
> >>>>>>>> Passing a container fd does not make much
> >>>>>>>> sense here as the VFIO device would walk through groups, reference them and
> >>>>>>>> that is it, there is no locking on VFIO containters and so far there was no
> >>>>>>>> need to teach KVM about containers.
> >>>>>>>>
> >>>>>>>> What do I miss now?
> >>>>>>>
> >>>>>>> Referencing the groups is essentially just a useful side effect.  The
> >>>>>>> important functionality is informing VFIO of the guest LIOBNs; and
> >>>>>>> LIOBNs map to containers, not groups.
> >>>>>>
> >>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>>>>> can be one or many or none containers per liobn.
> >>>>>
> >>>>> Ah, true.
> >>>>
> >>>> So I need to add new kernel API for KVM to get table(s) from VFIO
> >>>> container(s). And invent some locking mechanism to prevent table(s) (or
> >>>> associated container(s)) from going away, like
> >>>> vfio_group_get_external_user/vfio_group_put_external_user but for
> >>>> containers. Correct?
> >>>
> >>> Well, the container is attached to an fd, so if you get a reference on
> >>> the file* that should do it.
> >>
> >> I am still trying to think of how to implement this suggestion.
> >>
> >> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
> >> interface as it knows nothing about KVM. There is VFIO-KVM device but it
> >> does not have idea about containers.
> >>
> >> So I have to:
> >>
> >> Wenever a container is created or removed, notify the VFIO-KVM device by
> >> passing there a container fd. ok.
> > 
> > Actually, I don't think the vfio-kvm device is really useful here.  It
> > was designed as a hack for a particular problem on x86.  It certainly
> > could be extended to cover the information we need here, but I don't
> > think it's a particularly natural way of doing so.
> > 
> > The problem is that conveying the information from the vfio-kvm device
> > to the real mode H_PUT_TCE handler, which is what really needs it,
> > isn't particularly simpler than conveying that information from
> > anywhere else.
> > 
> >> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
> >> what LIOBN so the realmode handlers could do the job. The real mode TCE
> >> handlers get LIOBN, find a guest view table and update it. Now I want to
> >> update the hardware table which is iommu_table attached to LIOBN.
> >>
> >> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
> >>
> >> Now VFIO-KVM device needs to extract iommu_table's from the container.
> >> These iommu_table pointers are stored in "struct tce_container" which is
> >> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
> >> cannot export and use that.
> >>
> >> The other way to go would be adding API to VFIO to enumerate IOMMU groups
> >> in a container and use iommu_table pointers stored in iommu_table_group of
> >> each group (in fact the very first group will be enough as multiple groups
> >> in a container share the table). Adding vfio_container_get_groups() when
> >> only first one is needed is quite tricky in terms of maintainers approvals.
> >>
> >> So what would be the right course of action? Thanks.
> > 
> > So, from the user side, you need to be able to bind a VFIO backend to
> > a particular guest IOMMU.  This suggests a new ioctl() used in
> > conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
> > KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
> > kernel aware of a LIOBN in the first place, then use
> > KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
> > So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
> > need a capability to go with it, and some way to unbind as well.
> 
> This is what I had in the first place some years ago. And after 5-6 reviews
> I was told that there is a VFIO KVM and I should use it.

I suspect that's because Alex didn't fully understand what we required
here.  The primary thing here is that we need to link guest visible
LIOBNs to host-visible VFIO containers.  Your comments tended to
emphasise the fact of giving KVM a list of VFIO groups, which is a
side effect of the above, but not really the main point - it does,
however, sound misleadingly like what the kvm-vfio device already does.


> > To implement that, the ioctl() would need to use a new vfio (kernel
> > internal) interface - which can be specific to only the spapr TCE
> > type.  That would take the container fd, and return the list of
> > iommu_tables in some form or other (or various error conditions,
> > obviously).
> > 
> > So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
> > the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
> > it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
> > LIOBN.  The ioctl() implementation uses the new special interface into
> > the spapr_tce vfio backend to get the list of iommu tables, which it
> > stores in some private format.
> 
> Getting just a list of IOMMU groups would do too. Pushing such API is a
> problem, this is how I ended up with the current design.
> 
> 
> > The H_PUT_TCE implementation uses that
> > stored list of iommu tables to translate H_PUT_TCEs from the guest
> > into actions on the host IOMMU tables.
> > 
> > And, yes, the special interface to the spapr TCE vfio back end is kind
> > of a hack.  That's what you get when you need to link to separate
> > kernel subsystems for performance reasons.
> 
> One can argue if it is a hack, how is this hack better that the existing
> approach? :)

Because this hack is only on a kernel internal interface, rather than
impactin the user visible interface.

> Alex, could you please comment on David's suggestion? Thanks!

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
@ 2016-06-15  4:43                             ` David Gibson
  0 siblings, 0 replies; 92+ messages in thread
From: David Gibson @ 2016-06-15  4:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Paul Mackerras, Alex Williamson, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 22788 bytes --]

On Tue, Jun 14, 2016 at 01:30:53PM +1000, Alexey Kardashevskiy wrote:
> On 10/06/16 16:50, David Gibson wrote:
> > On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
> >> On 23/03/16 14:03, David Gibson wrote:
> >>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> >>>> Uff, lost cc: list. Added back. Some comments below.
> >>>>
> >>>>
> >>>> On 03/21/2016 04:19 PM, David Gibson wrote:
> >>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>>
> >>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>>>>> linked to IOMMU groups by the user space.
> >>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>>>>> is added which accepts:
> >>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>>>>> TCE table;
> >>>>>>>>>>>> - a LIOBN to assign to the found table.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Before notifying KVM about new link, this check the group for being
> >>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>>>>> finish.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>>>>> space.
> >>>>>>>>>>>>
> >>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>>>>> link to a KVM module.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> index ef51740..c0d3eb7 100644
> >>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
> >>>>>>>>>>>>
> >>>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>>
> >>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>>>>> case it might be clearer to put them in a separate patch.
> >>>>>>>>>>>
> >>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>>>>> tracking
> >>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> -for the VFIO group.
> >>>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
> >>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>>>> +			__u32	argsz;
> >>>>>>>>>>>> +			__s32	fd;
> >>>>>>>>>>>> +			__u32	liobn;
> >>>>>>>>>>>> +			__u8	pad[4];
> >>>>>>>>>>>> +			__u64	start_addr;
> >>>>>>>>>>>> +		};
> >>>>>>>>>>>> +		where
> >>>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
> >>>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>>>>>
> >>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>>>>> and 64bit windows, this is why @start_addr is there.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> index 1059846..dfa3488 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>>>>>  	select KVM
> >>>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>>>>> +	select KVM_VFIO if VFIO
> >>>>>>>>>>>>  	---help---
> >>>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> index 7f7b6d8..71f577c 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>>>>>
> >>>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>>>>> +		$(KVM)/eventfd.o
> >>>>>>>>>>>
> >>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
> >>>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>>>>> itself should be available always.
> >>>>>>>>>>
> >>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>>>>> @@ -87,6 +87,9 @@ endif
> >>>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>>>>>  	book3s_xics.o
> >>>>>>>>>>>>
> >>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>>>>> +	$(KVM)/vfio.o \
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> index 19aa59b..63f188d 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>>>>> *kvm, long ext)
> >>>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>>>>> index 080ffbf..f1abbea 100644
> >>>>>>>>>>>> --- a/include/uapi/linux/kvm.h
> >>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
> >>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>>>>>
> >>>>>>>>>>>>  enum kvm_device_type {
> >>>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>>>>>  };
> >>>>>>>>>>>>
> >>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>>>> +	__u32	argsz;
> >>>>>>>>>>>> +	__s32	fd;
> >>>>>>>>>>>> +	__u32	liobn;
> >>>>>>>>>>>> +	__u8	pad[4];
> >>>>>>>>>>>> +	__u64	start_addr;
> >>>>>>>>>>>> +};
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  /*
> >>>>>>>>>>>>   * ioctls for VM fds
> >>>>>>>>>>>>   */
> >>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>>>>> index 1dd087d..87c771e 100644
> >>>>>>>>>>>> --- a/virt/kvm/vfio.c
> >>>>>>>>>>>> +++ b/virt/kvm/vfio.c
> >>>>>>>>>>>> @@ -20,6 +20,10 @@
> >>>>>>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>>>>>  #include "vfio.h"
> >>>>>>>>>>>>
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +#include <asm/kvm_ppc.h>
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>>>>>  	struct list_head node;
> >>>>>>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>>>>> @@ -60,6 +64,22 @@ static void
> >>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>
> >>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>>>>> +{
> >>>>>>>>>>>> +	int (*fn)(struct vfio_group *);
> >>>>>>>>>>>> +	int ret = -1;
> >>>>>>>>>>>
> >>>>>>>>>>> Should this be -ESOMETHING?
> >>>>>>>>>>>
> >>>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>>>>> +	if (!fn)
> >>>>>>>>>>>> +		return ret;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	ret = fn(vfio_group);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	return ret;
> >>>>>>>>>>>> +}
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>>>>> kvm_device *dev)
> >>>>>>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>>>>> +		struct vfio_group *vfio_group)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
> >>>>>>>>>>> function?
> >>>>>>>>>>
> >>>>>>>>>> Having less patches which touch different maintainers areas is better. I
> >>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>>>>> table".
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> +{
> >>>>>>>>>>>> +	int group_id;
> >>>>>>>>>>>> +	struct iommu_group *grp;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>>>>> +	if (grp) {
> >>>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>>>>> +		iommu_group_put(grp);
> >>>>>>>>>>>> +	}
> >>>>>>>>>>>> +}
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>>>  				continue;
> >>>>>>>>>>>>
> >>>>>>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>
> >>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>>>>> callsite.
> >>>>>>>>>>
> >>>>>>>>>> It is questionable. A x86 reader may decide that
> >>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>>>>> confused.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>>>>> +					kvg->vfio_group);
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>>>>>  			kfree(kvg);
> >>>>>>>>>>>>  			ret = 0;
> >>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>>>>>
> >>>>>>>>>>>>  		return ret;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>>>>> +		unsigned long minsz;
> >>>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
> >>>>>>>>>>>> +		struct vfio_group *vfio_group;
> >>>>>>>>>>>> +		struct kvm_vfio_group *kvg;
> >>>>>>>>>>>> +		struct fd f;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>>>>> +				start_addr);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>>>>> +			return -EFAULT;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (param.argsz < minsz)
> >>>>>>>>>>>> +			return -EINVAL;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		f = fdget(param.fd);
> >>>>>>>>>>>> +		if (!f.file)
> >>>>>>>>>>>> +			return -EBADF;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>>>>> +		fdput(f);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (IS_ERR(vfio_group))
> >>>>>>>>>>>> +			return PTR_ERR(vfio_group);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		ret = -ENOENT;
> >>>>>>>>>>>
> >>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>>>>> possible a kernel could be built for a platform supporting multiple
> >>>>>>>>>>> IOMMU types.
> >>>>>>>>>>
> >>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
> >>>>>>>>>> groups.
> >>>>>>>>>
> >>>>>>>>> Which, as mentioned previously, is broken.
> >>>>>>>>
> >>>>>>>> Which I am failing to follow you on this.
> >>>>>>>>
> >>>>>>>> What I am trying to achieve here is pretty much referencing a group so it
> >>>>>>>> cannot be reused. Plus LIOBNs.
> >>>>>>>
> >>>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>>>>> (guest) LIOBN affects a group it must affect all groups in the
> >>>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>>>>> to group.
> >>>>>>
> >>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>>>>> Pass container fd and then implement new api to lock containers somehow and
> >>>>>
> >>>>> I'm not really understanding what the question is about locking containers.
> >>>>>
> >>>>>> enumerate groups when updating TCE table (including real mode)?
> >>>>>
> >>>>> Why do you need to enumerate groups?  The groups within the container
> >>>>> share a TCE table, so can't you just update that once?
> >>>>
> >>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> >>>> invalidate) register address, it is still per PE but this does not matter
> >>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> >>>> the picture.
> >>>
> >>> True, you'll need to enumerate the groups for invalidates.  But you
> >>> need that already, right.
> >>>
> >>>>>> Plus new API when we remove a group from a container as the result of guest
> >>>>>> PCI hot unplug?
> >>>>>
> >>>>> I assume you mean a kernel internal API, since it shouldn't need
> >>>>> anything else visible to userspace.  Won't this happen naturally?
> >>>>> When the group is removed from the container, it will get its own TCE
> >>>>> table instead of the previously shared one.
> >>>>>
> >>>>>>>> Passing a container fd does not make much
> >>>>>>>> sense here as the VFIO device would walk through groups, reference them and
> >>>>>>>> that is it, there is no locking on VFIO containters and so far there was no
> >>>>>>>> need to teach KVM about containers.
> >>>>>>>>
> >>>>>>>> What do I miss now?
> >>>>>>>
> >>>>>>> Referencing the groups is essentially just a useful side effect.  The
> >>>>>>> important functionality is informing VFIO of the guest LIOBNs; and
> >>>>>>> LIOBNs map to containers, not groups.
> >>>>>>
> >>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>>>>> can be one or many or none containers per liobn.
> >>>>>
> >>>>> Ah, true.
> >>>>
> >>>> So I need to add new kernel API for KVM to get table(s) from VFIO
> >>>> container(s). And invent some locking mechanism to prevent table(s) (or
> >>>> associated container(s)) from going away, like
> >>>> vfio_group_get_external_user/vfio_group_put_external_user but for
> >>>> containers. Correct?
> >>>
> >>> Well, the container is attached to an fd, so if you get a reference on
> >>> the file* that should do it.
> >>
> >> I am still trying to think of how to implement this suggestion.
> >>
> >> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
> >> interface as it knows nothing about KVM. There is VFIO-KVM device but it
> >> does not have idea about containers.
> >>
> >> So I have to:
> >>
> >> Wenever a container is created or removed, notify the VFIO-KVM device by
> >> passing there a container fd. ok.
> > 
> > Actually, I don't think the vfio-kvm device is really useful here.  It
> > was designed as a hack for a particular problem on x86.  It certainly
> > could be extended to cover the information we need here, but I don't
> > think it's a particularly natural way of doing so.
> > 
> > The problem is that conveying the information from the vfio-kvm device
> > to the real mode H_PUT_TCE handler, which is what really needs it,
> > isn't particularly simpler than conveying that information from
> > anywhere else.
> > 
> >> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
> >> what LIOBN so the realmode handlers could do the job. The real mode TCE
> >> handlers get LIOBN, find a guest view table and update it. Now I want to
> >> update the hardware table which is iommu_table attached to LIOBN.
> >>
> >> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
> >>
> >> Now VFIO-KVM device needs to extract iommu_table's from the container.
> >> These iommu_table pointers are stored in "struct tce_container" which is
> >> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
> >> cannot export and use that.
> >>
> >> The other way to go would be adding API to VFIO to enumerate IOMMU groups
> >> in a container and use iommu_table pointers stored in iommu_table_group of
> >> each group (in fact the very first group will be enough as multiple groups
> >> in a container share the table). Adding vfio_container_get_groups() when
> >> only first one is needed is quite tricky in terms of maintainers approvals.
> >>
> >> So what would be the right course of action? Thanks.
> > 
> > So, from the user side, you need to be able to bind a VFIO backend to
> > a particular guest IOMMU.  This suggests a new ioctl() used in
> > conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
> > KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
> > kernel aware of a LIOBN in the first place, then use
> > KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
> > So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
> > need a capability to go with it, and some way to unbind as well.
> 
> This is what I had in the first place some years ago. And after 5-6 reviews
> I was told that there is a VFIO KVM and I should use it.

I suspect that's because Alex didn't fully understand what we required
here.  The primary thing here is that we need to link guest visible
LIOBNs to host-visible VFIO containers.  Your comments tended to
emphasise the fact of giving KVM a list of VFIO groups, which is a
side effect of the above, but not really the main point - it does,
however, sound misleadingly like what the kvm-vfio device already does.


> > To implement that, the ioctl() would need to use a new vfio (kernel
> > internal) interface - which can be specific to only the spapr TCE
> > type.  That would take the container fd, and return the list of
> > iommu_tables in some form or other (or various error conditions,
> > obviously).
> > 
> > So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
> > the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
> > it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
> > LIOBN.  The ioctl() implementation uses the new special interface into
> > the spapr_tce vfio backend to get the list of iommu tables, which it
> > stores in some private format.
> 
> Getting just a list of IOMMU groups would do too. Pushing such API is a
> problem, this is how I ended up with the current design.
> 
> 
> > The H_PUT_TCE implementation uses that
> > stored list of iommu tables to translate H_PUT_TCEs from the guest
> > into actions on the host IOMMU tables.
> > 
> > And, yes, the special interface to the spapr TCE vfio back end is kind
> > of a hack.  That's what you get when you need to link to separate
> > kernel subsystems for performance reasons.
> 
> One can argue if it is a hack, how is this hack better that the existing
> approach? :)

Because this hack is only on a kernel internal interface, rather than
impactin the user visible interface.

> Alex, could you please comment on David's suggestion? Thanks!

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2016-06-15  4:43 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-07  3:41 [PATCH kernel 0/9] KVM, PPC, VFIO: Enable in-kernel acceleration Alexey Kardashevskiy
2016-03-07  3:41 ` Alexey Kardashevskiy
2016-03-07  3:41 ` [PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-07  4:58   ` David Gibson
2016-03-07  4:58     ` David Gibson
2016-03-07  3:41 ` [PATCH kernel 2/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-07  5:30   ` David Gibson
2016-03-07  5:30     ` David Gibson
2016-03-07  3:41 ` [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-07  6:00   ` David Gibson
2016-03-07  6:00     ` David Gibson
2016-03-08  5:47     ` Alexey Kardashevskiy
2016-03-08  5:47       ` Alexey Kardashevskiy
2016-03-08  6:30       ` David Gibson
2016-03-08  6:30         ` David Gibson
2016-03-09  8:55         ` Alexey Kardashevskiy
2016-03-09  8:55           ` Alexey Kardashevskiy
2016-03-09 23:46           ` David Gibson
2016-03-09 23:46             ` David Gibson
2016-03-10  8:33     ` Paul Mackerras
2016-03-10  8:33       ` Paul Mackerras
2016-03-10 23:42       ` David Gibson
2016-03-10 23:42         ` David Gibson
2016-03-07  3:41 ` [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg() Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-07  6:05   ` David Gibson
2016-03-07  6:05     ` David Gibson
2016-03-07  7:32     ` Alexey Kardashevskiy
2016-03-07  7:32       ` Alexey Kardashevskiy
2016-03-08  4:50       ` David Gibson
2016-03-08  4:50         ` David Gibson
2016-03-10  8:43   ` Paul Mackerras
2016-03-10  8:43     ` Paul Mackerras
2016-03-10  8:46   ` Paul Mackerras
2016-03-10  8:46     ` Paul Mackerras
2016-03-07  3:41 ` [PATCH kernel 5/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-07  3:41 ` [PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE table Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-07  6:25   ` David Gibson
2016-03-07  6:25     ` David Gibson
2016-03-07  9:38     ` Alexey Kardashevskiy
2016-03-07  9:38       ` Alexey Kardashevskiy
2016-03-08  4:55       ` David Gibson
2016-03-08  4:55         ` David Gibson
2016-03-07  3:41 ` [PATCH kernel 7/9] KVM: PPC: Create a virtual-mode only TCE table handlers Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-08  6:32   ` David Gibson
2016-03-08  6:32     ` David Gibson
2016-03-07  3:41 ` [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-08 11:08   ` David Gibson
2016-03-08 11:08     ` David Gibson
2016-03-09  8:46     ` Alexey Kardashevskiy
2016-03-09  8:46       ` Alexey Kardashevskiy
2016-03-10  5:18       ` David Gibson
2016-03-10  5:18         ` David Gibson
2016-03-11  2:15         ` Alexey Kardashevskiy
2016-03-11  2:15           ` Alexey Kardashevskiy
2016-03-15  6:00           ` David Gibson
2016-03-15  6:00             ` David Gibson
2016-03-07  3:41 ` [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE Alexey Kardashevskiy
2016-03-07  3:41   ` Alexey Kardashevskiy
2016-03-09  5:45   ` David Gibson
2016-03-09  5:45     ` David Gibson
2016-03-09  9:20     ` Alexey Kardashevskiy
2016-03-09  9:20       ` Alexey Kardashevskiy
2016-03-10  5:21       ` David Gibson
2016-03-10  5:21         ` David Gibson
2016-03-10 23:09         ` Alexey Kardashevskiy
2016-03-10 23:09           ` Alexey Kardashevskiy
2016-03-15  6:04           ` David Gibson
2016-03-15  6:04             ` David Gibson
     [not found]             ` <15389a41428.27cb.1ca38dd7e845b990cd13d431eb58563d@ozlabs.ru>
     [not found]               ` <20160321051932.GJ23586@voom.redhat.com>
2016-03-22  0:34                 ` Alexey Kardashevskiy
2016-03-22  0:34                   ` Alexey Kardashevskiy
2016-03-23  3:03                   ` David Gibson
2016-03-23  3:03                     ` David Gibson
2016-06-09  6:47                     ` Alexey Kardashevskiy
2016-06-09  6:47                       ` Alexey Kardashevskiy
2016-06-10  6:50                       ` David Gibson
2016-06-10  6:50                         ` David Gibson
2016-06-14  3:30                         ` Alexey Kardashevskiy
2016-06-14  3:30                           ` Alexey Kardashevskiy
2016-06-15  4:43                           ` David Gibson
2016-06-15  4:43                             ` David Gibson
2016-04-08  9:13     ` Alexey Kardashevskiy
2016-04-08  9:13       ` Alexey Kardashevskiy
2016-04-11  3:36       ` David Gibson
2016-04-11  3:36         ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.