All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration
@ 2016-08-03  8:40 Alexey Kardashevskiy
  2016-08-03  8:40 ` [PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id" Alexey Kardashevskiy
                   ` (14 more replies)
  0 siblings, 15 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras


This is my current queue of patches to add acceleration of TCE
updates in KVM. This has a long history and was rewritten pretty
much completely again, this time I am teaching KVM about VFIO
containers. Some patches (such as 01/15) could be posted
separately but I keep all of them here to make review easier
(if the concept turns out be wrong - then I might still want
to have 01/15).

Please comment. Thanks.


Alexey Kardashevskiy (15):
  Revert "iommu: Add a function to find an iommu group by id"
  KVM: PPC: Finish enabling VFIO KVM device on POWER
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again
  powerpc/iommu: Stop using @current in mm_iommu_xxx
  powerpc/mm/iommu: Put pages on process exit
  powerpc/iommu: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of
    iommu_table_ops::exchange()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  vfio/spapr_tce: Export container API for external users
  KVM: PPC: Add in-kernel acceleration for VFIO

 arch/powerpc/include/asm/iommu.h          |  12 +-
 arch/powerpc/include/asm/kvm_host.h       |   8 +
 arch/powerpc/include/asm/kvm_ppc.h        |   2 +-
 arch/powerpc/include/asm/mmu_context.h    |  23 +-
 arch/powerpc/include/uapi/asm/kvm.h       |  12 +
 arch/powerpc/kernel/iommu.c               |  49 +++-
 arch/powerpc/kernel/setup-common.c        |   2 +-
 arch/powerpc/kernel/vio.c                 |   2 +-
 arch/powerpc/kvm/Kconfig                  |   2 +
 arch/powerpc/kvm/Makefile                 |   3 +
 arch/powerpc/kvm/book3s_64_vio.c          | 410 +++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c       | 251 ++++++++++++++++--
 arch/powerpc/kvm/powerpc.c                |   2 +
 arch/powerpc/mm/mmu_context_book3s64.c    |   6 +-
 arch/powerpc/mm/mmu_context_iommu.c       |  96 ++++---
 arch/powerpc/platforms/powernv/pci-ioda.c |  46 +++-
 arch/powerpc/platforms/powernv/pci.c      |   1 +
 arch/powerpc/platforms/pseries/iommu.c    |   3 +-
 drivers/iommu/iommu.c                     |  29 ---
 drivers/vfio/vfio.c                       |  30 +++
 drivers/vfio/vfio_iommu_spapr_tce.c       | 107 ++++++--
 include/linux/iommu.h                     |   1 -
 include/linux/vfio.h                      |   6 +
 include/uapi/linux/kvm.h                  |   1 +
 24 files changed, 959 insertions(+), 145 deletions(-)

-- 
2.5.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id"
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-15  4:58   ` Paul Mackerras
  2016-08-03  8:40 ` [PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

This reverts commit aa16bea929ae
("iommu: Add a function to find an iommu group by id")
as the iommu_group_get_by_id() helper has never been used
and it is unlikely it will in foreseeable future. Dead code
is broken code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/iommu/iommu.c | 29 -----------------------------
 include/linux/iommu.h |  1 -
 2 files changed, 30 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b06d935..d2f5efe 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -217,35 +217,6 @@ struct iommu_group *iommu_group_alloc(void)
 }
 EXPORT_SYMBOL_GPL(iommu_group_alloc);
 
-struct iommu_group *iommu_group_get_by_id(int id)
-{
-	struct kobject *group_kobj;
-	struct iommu_group *group;
-	const char *name;
-
-	if (!iommu_group_kset)
-		return NULL;
-
-	name = kasprintf(GFP_KERNEL, "%d", id);
-	if (!name)
-		return NULL;
-
-	group_kobj = kset_find_obj(iommu_group_kset, name);
-	kfree(name);
-
-	if (!group_kobj)
-		return NULL;
-
-	group = container_of(group_kobj, struct iommu_group, kobj);
-	BUG_ON(group->id != id);
-
-	kobject_get(group->devices_kobj);
-	kobject_put(&group->kobj);
-
-	return group;
-}
-EXPORT_SYMBOL_GPL(iommu_group_get_by_id);
-
 /**
  * iommu_group_get_iommudata - retrieve iommu_data registered for a group
  * @group: the group
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a35fb8b..93c69fa 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -215,7 +215,6 @@ extern int bus_set_iommu(struct bus_type *bus, const struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern bool iommu_capable(struct bus_type *bus, enum iommu_cap cap);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
-extern struct iommu_group *iommu_group_get_by_id(int id);
 extern void iommu_domain_free(struct iommu_domain *domain);
 extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
  2016-08-03  8:40 ` [PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id" Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-04  5:21   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 03/15] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

178a787502 "vfio: Enable VFIO device for powerpc" made an attempt to
enable VFIO KVM device on POWER.

However as CONFIG_KVM_BOOK3S_64 does not use "common-objs-y",
VFIO KVM device was not enabled for Book3s KVM, this adds VFIO to
the kvm-book3s_64-objs-y list.

While we are here, enforce KVM_VFIO on KVM_BOOK3S as other platforms
already do.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/Kconfig  | 1 +
 arch/powerpc/kvm/Makefile | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c2024ac..b7c494b 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+	select KVM_VFIO if VFIO
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 1f9e552..8907af9 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -88,6 +88,9 @@ endif
 kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
 	book3s_xics.o
 
+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
+	$(KVM)/vfio.o
+
 kvm-book3s_64-module-objs += \
 	$(KVM)/kvm_main.o \
 	$(KVM)/eventfd.o \
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 03/15] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
  2016-08-03  8:40 ` [PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id" Alexey Kardashevskiy
  2016-08-03  8:40 ` [PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-03  8:40 ` [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e98bb4c..3b4b723 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 03/15] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-04  5:23   ` David Gibson
  2016-08-09 11:26   ` [kernel, " Michael Ellerman
  2016-08-03  8:40 ` [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

"powerpc/powernv/pci: Rework accessing the TCE invalidate register"
broke TCE invalidation on IODA2/PHB3 for real mode.

This makes invalidate work again.

Fixes: fd141d1a99a3
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 53b56c0..59c7e7d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1877,7 +1877,7 @@ static void pnv_pci_phb3_tce_invalidate(struct pnv_ioda_pe *pe, bool rm,
 					unsigned shift, unsigned long index,
 					unsigned long npages)
 {
-	__be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, false);
+	__be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, rm);
 	unsigned long start, end, inc;
 
 	/* We'll invalidate DMA address in PE scope */
@@ -1935,10 +1935,12 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 			pnv_pci_phb3_tce_invalidate(pe, rm, shift,
 						    index, npages);
 		else if (rm)
+		{
 			opal_rm_pci_tce_kill(phb->opal_id,
 					     OPAL_PCI_TCE_KILL_PAGES,
 					     pe->pe_number, 1u << shift,
 					     index << shift, npages);
+		}
 		else
 			opal_pci_tce_kill(phb->opal_id,
 					  OPAL_PCI_TCE_KILL_PAGES,
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-03 10:10   ` Nicholas Piggin
                     ` (3 more replies)
  2016-08-03  8:40 ` [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  14 siblings, 4 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better cache @mm and use it later when the process is gone
(@current or @current->mm are NULL).

This changes mm_iommu_xxx API to receive mm_struct instead of using one
from @current.

This is needed by the following patch to do proper cleanup in time.
This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
to do proper cleanup via tce_iommu_clear() patch.

To keep API consistent, this replaces mm_context_t with mm_struct;
we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
access to &mm->mmap_sem.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h | 20 +++++++------
 arch/powerpc/kernel/setup-common.c     |  2 +-
 arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
 arch/powerpc/mm/mmu_context_iommu.c    | 54 ++++++++++++++--------------------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 41 ++++++++++++++++----------
 5 files changed, 62 insertions(+), 59 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 9d2cd0c..b85cc7b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -18,16 +18,18 @@ extern void destroy_context(struct mm_struct *mm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 struct mm_iommu_table_group_mem_t;
 
-extern bool mm_iommu_preregistered(void);
-extern long mm_iommu_get(unsigned long ua, unsigned long entries,
+extern bool mm_iommu_preregistered(struct mm_struct *mm);
+extern long mm_iommu_get(struct mm_struct *mm,
+		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
-extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
-extern void mm_iommu_init(mm_context_t *ctx);
-extern void mm_iommu_cleanup(mm_context_t *ctx);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-		unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-		unsigned long entries);
+extern long mm_iommu_put(struct mm_struct *mm,
+		struct mm_iommu_table_group_mem_t *mem);
+extern void mm_iommu_init(struct mm_struct *mm);
+extern void mm_iommu_cleanup(struct mm_struct *mm);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 714b4ba..e90b68a 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -905,7 +905,7 @@ void __init setup_arch(char **cmdline_p)
 	init_mm.context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-	mm_iommu_init(&init_mm.context);
+	mm_iommu_init(&init_mm);
 #endif
 	irqstack_early_init();
 	exc_lvl_early_init();
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index b114f8b..ad82735 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	mm->context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-	mm_iommu_init(&mm->context);
+	mm_iommu_init(mm);
 #endif
 	return 0;
 }
@@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
 void destroy_context(struct mm_struct *mm)
 {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-	mm_iommu_cleanup(&mm->context);
+	mm_iommu_cleanup(mm);
 #endif
 
 #ifdef CONFIG_PPC_ICSWX
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..ee6685b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
 	}
 
 	pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-			current->pid,
+			current ? current->pid : 0,
 			incr ? '+' : '-',
 			npages << PAGE_SHIFT,
 			mm->locked_vm << PAGE_SHIFT,
@@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
 	return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(struct mm_struct *mm)
 {
-	if (!current || !current->mm)
-		return false;
-
-	return !list_empty(&current->mm->context.iommu_group_mem_list);
+	return !list_empty(&mm->context.iommu_group_mem_list);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-long mm_iommu_get(unsigned long ua, unsigned long entries,
+long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	long i, j, ret = 0, locked_entries = 0;
 	struct page *page = NULL;
 
-	if (!current || !current->mm)
-		return -ESRCH; /* process exited */
-
 	mutex_lock(&mem_list_mutex);
 
-	list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
+	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
 			next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
 			++mem->used;
@@ -102,7 +96,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
 	}
 
-	ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
+	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
 	if (ret)
 		goto unlock_exit;
 
@@ -142,11 +136,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 	mem->entries = entries;
 	*pmem = mem;
 
-	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
+	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
 
 unlock_exit:
 	if (locked_entries && ret)
-		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
+		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
 
 	mutex_unlock(&mem_list_mutex);
 
@@ -191,16 +185,13 @@ static void mm_iommu_free(struct rcu_head *head)
 static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 {
 	list_del_rcu(&mem->next);
-	mm_iommu_adjust_locked_vm(current->mm, mem->entries, false);
 	call_rcu(&mem->rcu, mm_iommu_free);
 }
 
-long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
+long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
 
-	if (!current || !current->mm)
-		return -ESRCH; /* process exited */
 
 	mutex_lock(&mem_list_mutex);
 
@@ -224,6 +215,8 @@ long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
 	/* @mapped became 0 so now mappings are disabled, release the region */
 	mm_iommu_release(mem);
 
+	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
+
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
 
@@ -231,14 +224,12 @@ unlock_exit:
 }
 EXPORT_SYMBOL_GPL(mm_iommu_put);
 
-struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-		unsigned long size)
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
-	list_for_each_entry_rcu(mem,
-			&current->mm->context.iommu_group_mem_list,
-			next) {
+	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua <= ua) &&
 				(ua + size <= mem->ua +
 				 (mem->entries << PAGE_SHIFT))) {
@@ -251,14 +242,12 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
-struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-		unsigned long entries)
+struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+		unsigned long ua, unsigned long entries)
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
-	list_for_each_entry_rcu(mem,
-			&current->mm->context.iommu_group_mem_list,
-			next) {
+	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
 			ret = mem;
 			break;
@@ -300,16 +289,17 @@ void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
 
-void mm_iommu_init(mm_context_t *ctx)
+void mm_iommu_init(struct mm_struct *mm)
 {
-	INIT_LIST_HEAD_RCU(&ctx->iommu_group_mem_list);
+	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
 }
 
-void mm_iommu_cleanup(mm_context_t *ctx)
+void mm_iommu_cleanup(struct mm_struct *mm)
 {
 	struct mm_iommu_table_group_mem_t *mem, *tmp;
 
-	list_for_each_entry_safe(mem, tmp, &ctx->iommu_group_mem_list, next) {
+	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
+			next) {
 		list_del_rcu(&mem->next);
 		mm_iommu_do_free(mem);
 	}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 80378dd..9752e77 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -98,6 +98,7 @@ struct tce_container {
 	bool enabled;
 	bool v2;
 	unsigned long locked_pages;
+	struct mm_struct *mm;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 	struct list_head group_list;
 };
@@ -110,11 +111,11 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
 
-	mem = mm_iommu_find(vaddr, size >> PAGE_SHIFT);
+	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
 	if (!mem)
 		return -ENOENT;
 
-	return mm_iommu_put(mem);
+	return mm_iommu_put(container->mm, mem);
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
 			((vaddr + size) < vaddr))
 		return -EINVAL;
 
-	ret = mm_iommu_get(vaddr, entries, &mem);
+	if (!container->mm) {
+		if (!current->mm)
+			return -ESRCH; /* process exited */
+
+		atomic_inc(&current->mm->mm_count);
+		container->mm = current->mm;
+	}
+
+	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
 	if (ret)
 		return ret;
-
 	container->enabled = true;
 
 	return 0;
@@ -354,6 +362,8 @@ static void tce_iommu_release(void *iommu_data)
 		tce_iommu_free_table(tbl);
 	}
 
+	if (container->mm)
+		mmdrop(container->mm);
 	tce_iommu_disable(container);
 	mutex_destroy(&container->lock);
 
@@ -369,13 +379,14 @@ static void tce_iommu_unuse_page(struct tce_container *container,
 	put_page(page);
 }
 
-static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
+static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
+		unsigned long tce, unsigned long size,
 		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
 {
 	long ret = 0;
 	struct mm_iommu_table_group_mem_t *mem;
 
-	mem = mm_iommu_lookup(tce, size);
+	mem = mm_iommu_lookup(container->mm, tce, size);
 	if (!mem)
 		return -EINVAL;
 
@@ -388,18 +399,18 @@ static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
 	return 0;
 }
 
-static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
-		unsigned long entry)
+static void tce_iommu_unuse_page_v2(struct tce_container *container,
+		struct iommu_table *tbl, unsigned long entry)
 {
 	struct mm_iommu_table_group_mem_t *mem = NULL;
 	int ret;
 	unsigned long hpa = 0;
 	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
 
-	if (!pua || !current || !current->mm)
+	if (!pua)
 		return;
 
-	ret = tce_iommu_prereg_ua_to_hpa(*pua, IOMMU_PAGE_SIZE(tbl),
+	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
 			&hpa, &mem);
 	if (ret)
 		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
@@ -429,7 +440,7 @@ static int tce_iommu_clear(struct tce_container *container,
 			continue;
 
 		if (container->v2) {
-			tce_iommu_unuse_page_v2(tbl, entry);
+			tce_iommu_unuse_page_v2(container, tbl, entry);
 			continue;
 		}
 
@@ -514,8 +525,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
 				entry + i);
 
-		ret = tce_iommu_prereg_ua_to_hpa(tce, IOMMU_PAGE_SIZE(tbl),
-				&hpa, &mem);
+		ret = tce_iommu_prereg_ua_to_hpa(container,
+				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
 		if (ret)
 			break;
 
@@ -536,7 +547,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
 		if (ret) {
 			/* dirtmp cannot be DMA_NONE here */
-			tce_iommu_unuse_page_v2(tbl, entry + i);
+			tce_iommu_unuse_page_v2(container, tbl, entry + i);
 			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
 					__func__, entry << tbl->it_page_shift,
 					tce, ret);
@@ -544,7 +555,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		}
 
 		if (dirtmp != DMA_NONE)
-			tce_iommu_unuse_page_v2(tbl, entry + i);
+			tce_iommu_unuse_page_v2(container, tbl, entry + i);
 
 		*pua = tce;
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-03 10:11   ` Nicholas Piggin
  2016-08-12  3:13   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
the userspace starts using VFIO. When the userspace process finishes,
all the pinned pages need to be put; this is done as a part of
the userspace memory context (MM) destruction which happens on
the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This references and caches MM once per container and adds tracking
how many times each preregistered area was registered in
a specific container. This way we do not depend on @current pointing to
a valid task descriptor.

This changes the userspace interface to return EBUSY if memory is
already registered (mm_iommu_get() used to increment the counter);
however it should not have any practical effect as the only
userspace tool available now does register memory area once per
container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

# Conflicts:
#	arch/powerpc/include/asm/mmu_context.h
#	arch/powerpc/mm/mmu_context_book3s64.c
#	arch/powerpc/mm/mmu_context_iommu.c
---
 arch/powerpc/include/asm/mmu_context.h |  1 -
 arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
 arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 52 +++++++++++++++++++++++++++++++++-
 4 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b85cc7b..a4c4ed5 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -25,7 +25,6 @@ extern long mm_iommu_get(struct mm_struct *mm,
 extern long mm_iommu_put(struct mm_struct *mm,
 		struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
-extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index ad82735..1a07969 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-	mm_iommu_cleanup(mm);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
 	drop_cop(mm->context.acop, mm);
 	kfree(mm->context.cop_lockp);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index ee6685b..10f01fe 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -293,14 +293,3 @@ void mm_iommu_init(struct mm_struct *mm)
 {
 	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
 }
-
-void mm_iommu_cleanup(struct mm_struct *mm)
-{
-	struct mm_iommu_table_group_mem_t *mem, *tmp;
-
-	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
-			next) {
-		list_del_rcu(&mem->next);
-		mm_iommu_do_free(mem);
-	}
-}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 9752e77..40e71a0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -89,6 +89,15 @@ struct tce_iommu_group {
 };
 
 /*
+ * A container needs to remember which preregistered areas and how many times
+ * it has referenced to do proper cleanup at the userspace process exit.
+ */
+struct tce_iommu_prereg {
+	struct list_head next;
+	struct mm_iommu_table_group_mem_t *mem;
+};
+
+/*
  * The container descriptor supports only a single group per container.
  * Required by the API as the container is not supplied with the IOMMU group
  * at the moment of initialization.
@@ -101,12 +110,26 @@ struct tce_container {
 	struct mm_struct *mm;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 	struct list_head group_list;
+	struct list_head prereg_list;
 };
 
+static long tce_iommu_prereg_free(struct tce_container *container,
+		struct tce_iommu_prereg *tcemem)
+{
+	long ret;
+
+	list_del(&tcemem->next);
+	ret = mm_iommu_put(container->mm, tcemem->mem);
+	kfree(tcemem);
+
+	return ret;
+}
+
 static long tce_iommu_unregister_pages(struct tce_container *container,
 		__u64 vaddr, __u64 size)
 {
 	struct mm_iommu_table_group_mem_t *mem;
+	struct tce_iommu_prereg *tcemem;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
@@ -115,7 +138,12 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 	if (!mem)
 		return -ENOENT;
 
-	return mm_iommu_put(container->mm, mem);
+	list_for_each_entry(tcemem, &container->prereg_list, next) {
+		if (tcemem->mem == mem)
+			return tce_iommu_prereg_free(container, tcemem);
+	}
+
+	return -ENOENT;
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -123,6 +151,7 @@ static long tce_iommu_register_pages(struct tce_container *container,
 {
 	long ret = 0;
 	struct mm_iommu_table_group_mem_t *mem = NULL;
+	struct tce_iommu_prereg *tcemem;
 	unsigned long entries = size >> PAGE_SHIFT;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
@@ -140,6 +169,18 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
 	if (ret)
 		return ret;
+
+	list_for_each_entry(tcemem, &container->prereg_list, next) {
+		if (tcemem->mem == mem) {
+			mm_iommu_put(container->mm, mem);
+			return -EBUSY;
+		}
+	}
+
+	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
+	tcemem->mem = mem;
+	list_add(&tcemem->next, &container->prereg_list);
+
 	container->enabled = true;
 
 	return 0;
@@ -325,6 +366,7 @@ static void *tce_iommu_open(unsigned long arg)
 
 	mutex_init(&container->lock);
 	INIT_LIST_HEAD_RCU(&container->group_list);
+	INIT_LIST_HEAD_RCU(&container->prereg_list);
 
 	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
 
@@ -362,6 +404,14 @@ static void tce_iommu_release(void *iommu_data)
 		tce_iommu_free_table(tbl);
 	}
 
+	while (!list_empty(&container->prereg_list)) {
+		struct tce_iommu_prereg *tcemem;
+
+		tcemem = list_first_entry(&container->prereg_list,
+				struct tce_iommu_prereg, next);
+		tce_iommu_prereg_free(container, tcemem);
+	}
+
 	if (container->mm)
 		mmdrop(container->mm);
 	tce_iommu_disable(container);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-12  3:18   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

At the moment iommu_table could be disposed by either calling
iommu_table_free() directly or it_ops::free() which only implementation
for IODA2 calls iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter everywhere. The free() callback now handles only
platform-specific data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c               | 4 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a8e3490..13263b0 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -718,6 +718,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	if (!tbl)
 		return;
 
+	if (tbl->it_ops->free)
+		tbl->it_ops->free(tbl);
+
 	if (!tbl->it_map) {
 		kfree(tbl);
 		return;
@@ -744,6 +747,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 59c7e7d..74ab8382 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1394,7 +1394,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	pnv_pci_ioda2_table_free_pages(tbl);
 	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -1987,7 +1986,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2313,7 +2311,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		pnv_ioda2_table_free(tbl);
+		iommu_free_table(tbl, "");
 		return rc;
 	}
 
@@ -2399,7 +2397,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	pnv_ioda2_table_free(tbl);
+	iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 40e71a0..79f26c7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -660,7 +660,7 @@ static void tce_iommu_free_table(struct iommu_table *tbl)
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl);
-	tbl->it_ops->free(tbl);
+	iommu_free_table(tbl, "");
 	decrement_locked_vm(pages);
 }
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-12  3:29   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change by implementing in-kernel
acceleration of DMA mapping requests, including real mode.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but will be in the following one.

While we are here, this removes @node_name parameter as it has never been
really useful on powernv and carrying it for the pseries platform code to
iommu_free_table() seems to be quite useless too.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  5 +++--
 arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
 arch/powerpc/kernel/vio.c                 |  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  1 +
 arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
 drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f49a72a..cd4df44 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -114,6 +114,7 @@ struct iommu_table {
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	unsigned long *it_userspace; /* userspace view of the table */
 	struct iommu_table_ops *it_ops;
+	struct kref    it_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 13263b0..a8f017a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -710,13 +710,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct iommu_table *tbl;
 
-	if (!tbl)
-		return;
+	tbl = container_of(kref, struct iommu_table, it_kref);
 
 	if (tbl->it_ops->free)
 		tbl->it_ops->free(tbl);
@@ -735,7 +735,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+		pr_warn("%s: Unexpected TCEs\n", __func__);
 
 	/* calculate bitmap size in bytes */
 	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -747,7 +747,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	/* free table */
 	kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+	kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+	if (!tbl)
+		return;
+
+	kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 8d7358f..188f452 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (tbl)
-		iommu_free_table(tbl, of_node_full_name(dev->of_node));
+		iommu_table_put(tbl);
 	of_node_put(dev->of_node);
 	kfree(to_vio_dev(dev));
 }
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 74ab8382..c04afd2 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1394,7 +1394,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 		iommu_group_put(pe->table_group.group);
 		BUG_ON(pe->table_group.group);
 	}
-	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2171,7 +2171,7 @@ found:
 		__free_pages(tce_mem, get_order(tce32_segsz * segs));
 	if (tbl) {
 		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 	}
 }
 
@@ -2265,7 +2265,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 			bus_offset, page_shift, window_size,
 			levels, tbl);
 	if (ret) {
-		iommu_free_table(tbl, "pnv");
+		iommu_table_put(tbl);
 		return ret;
 	}
 
@@ -2311,7 +2311,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
 				rc);
-		iommu_free_table(tbl, "");
+		iommu_table_put(tbl);
 		return rc;
 	}
 
@@ -2397,7 +2397,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -3311,7 +3311,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
@@ -3338,7 +3338,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
 	}
 
 	pnv_pci_ioda2_table_free_pages(tbl);
-	iommu_free_table(tbl, "pnv");
+	iommu_table_put(tbl);
 }
 
 static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 6701dd5..5917439 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
 
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 
 	return tbl;
 }
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 0056856..da29518 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 		goto fail_exit;
 
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
+	kref_init(&tbl->it_kref);
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
@@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		BUG_ON(table_group->group);
 	}
 #endif
-	iommu_free_table(tbl, node_name);
+	iommu_table_put(tbl);
 
 	kfree(table_group);
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 79f26c7..3594ad3 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -660,7 +660,7 @@ static void tce_iommu_free_table(struct iommu_table *tbl)
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	tce_iommu_userspace_view_free(tbl);
-	iommu_free_table(tbl, "");
+	iommu_table_put(tbl);
 	decrement_locked_vm(pages);
 }
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-12  4:02   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++++
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index a4c4ed5..939030c 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -27,10 +27,14 @@ extern long mm_iommu_put(struct mm_struct *mm,
 extern void mm_iommu_init(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+		struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 10f01fe..36a906c 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -242,6 +242,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
+{
+	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+			next) {
+		if ((mem->ua <= ua) &&
+				(ua + size <= mem->ua +
+				 (mem->entries << PAGE_SHIFT))) {
+			ret = mem;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
@@ -273,6 +292,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+		unsigned long ua, unsigned long *hpa)
+{
+	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+	void *va = &mem->hpas[entry];
+	unsigned long *ra;
+
+	if (entry >= mem->entries)
+		return -EFAULT;
+
+	ra = (void *) vmalloc_to_phys(va);
+	if (!ra)
+		return -EFAULT;
+
+	*hpa = *ra | (ua & ~PAGE_MASK);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-12  4:17   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL where declared
(not in this patch).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* updated the commit log with Paul's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index d461c44..a3be4bd 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+	return mm_iommu_preregistered(vcpu->kvm->mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
@@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-		return H_TOO_HARD;
+	if (kvmppc_preregistered(vcpu)) {
+		/*
+		 * We get here if guest memory was pre-registered which
+		 * is normally VFIO case and gpa->hpa translation does not
+		 * depend on hpt.
+		 */
+		struct mm_iommu_table_group_mem_t *mem;
 
-	rmap = (void *) vmalloc_to_phys(rmap);
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+			return H_TOO_HARD;
 
-	/*
-	 * Synchronize with the MMU notifier callbacks in
-	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-	 * While we have the rmap lock, code running on other CPUs
-	 * cannot finish unmapping the host real page that backs
-	 * this guest real page, so we are OK to access the host
-	 * real page.
-	 */
-	lock_rmap(rmap);
-	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-		ret = H_TOO_HARD;
-		goto unlock_exit;
+		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
+			return H_TOO_HARD;
+	} else {
+		/*
+		 * This is emulated devices case.
+		 * We do not require memory to be preregistered in this case
+		 * so lock rmap and do __find_linux_pte_or_hugepte().
+		 */
+		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+			return H_TOO_HARD;
+
+		rmap = (void *) vmalloc_to_phys(rmap);
+
+		/*
+		 * Synchronize with the MMU notifier callbacks in
+		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+		 * While we have the rmap lock, code running on other CPUs
+		 * cannot finish unmapping the host real page that backs
+		 * this guest real page, so we are OK to access the host
+		 * real page.
+		 */
+		lock_rmap(rmap);
+		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+			ret = H_TOO_HARD;
+			goto unlock_exit;
+		}
 	}
 
 	for (i = 0; i < npages; ++i) {
@@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 unlock_exit:
-	unlock_rmap(rmap);
+	if (rmap)
+		unlock_rmap(rmap);
 
 	return ret;
 }
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-12  4:29   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  7 +++++++
 arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cd4df44..a13d207 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
 			long index,
 			unsigned long *hpa,
 			enum dma_data_direction *direction);
+	/* Real mode */
+	int (*exchange_rm)(struct iommu_table *tbl,
+			long index,
+			unsigned long *hpa,
+			enum dma_data_direction *direction);
 #endif
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
@@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a8f017a..65b2dac 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1020,6 +1020,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret;
+
+	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+			(*direction == DMA_BIDIRECTIONAL))) {
+		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+		if (likely(pg)) {
+			SetPageDirty(pg);
+		} else {
+			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+			ret = -EFAULT;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c04afd2..a0b5ea6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1827,6 +1827,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1841,6 +1852,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda1_tce_xchg,
+	.exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda1_tce_free,
 	.get = pnv_tce_get,
@@ -1915,7 +1927,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 {
 	struct iommu_table_group_link *tgl;
 
-	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		struct pnv_phb *phb = pe->phb;
@@ -1973,6 +1985,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
 
 	return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+		unsigned long *hpa, enum dma_data_direction *direction)
+{
+	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+	if (!ret)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
+
+	return ret;
+}
 #endif
 
 static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
@@ -1992,6 +2015,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
 	.exchange = pnv_ioda2_tce_xchg,
+	.exchange_rm = pnv_ioda2_tce_xchg_rm,
 #endif
 	.clear = pnv_ioda2_tce_free,
 	.get = pnv_tce_get,
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-12  4:34   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index b7c494b..63b60a8 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
 	select KVM
 	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
 	select KVM_VFIO if VFIO
+	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table()
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-12  4:45   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users Alexey Kardashevskiy
  2016-08-03  8:40 ` [PATCH kernel 15/15] KVM: PPC: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
  14 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 2544eda..7f1abe9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-		struct kvm_vcpu *vcpu, unsigned long liobn);
+		struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
 		unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5..15df8ae 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ fail:
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	u64 __user *tces;
 	u64 tce;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index a3be4bd..8a6834e 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -49,10 +49,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
 		unsigned long liobn)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
 	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long tces, entry, ua = 0;
 	unsigned long *rmap = NULL;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
 
-	stt = kvmppc_find_table(vcpu, liobn);
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
@@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba)
 {
-	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+	struct kvmppc_spapr_tce_table *stt;
 	long ret;
 	unsigned long idx;
 	struct page *page;
 	u64 *tbl;
 
+	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
 		return H_TOO_HARD;
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  2016-08-08 16:43   ` Alex Williamson
  2016-08-12  5:25   ` David Gibson
  2016-08-03  8:40 ` [PATCH kernel 15/15] KVM: PPC: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
  14 siblings, 2 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

This exports helpers which are needed to keep a VFIO container in
memory while there are external users such as KVM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
 include/linux/vfio.h                |  6 ++++++
 3 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index d1d70e0..baf6a9c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
 EXPORT_SYMBOL_GPL(vfio_external_check_extension);
 
 /**
+ * External user API for containers, exported by symbols to be linked
+ * dynamically.
+ *
+ */
+struct vfio_container *vfio_container_get_ext(struct file *filep)
+{
+	struct vfio_container *container = filep->private_data;
+
+	if (filep->f_op != &vfio_fops)
+		return ERR_PTR(-EINVAL);
+
+	vfio_container_get(container);
+
+	return container;
+}
+EXPORT_SYMBOL_GPL(vfio_container_get_ext);
+
+void vfio_container_put_ext(struct vfio_container *container)
+{
+	vfio_container_put(container);
+}
+EXPORT_SYMBOL_GPL(vfio_container_put_ext);
+
+void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
+{
+	return container->iommu_data;
+}
+EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
+
+/**
  * Sub-module support
  */
 /*
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 3594ad3..fceea3d 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
 	.detach_group	= tce_iommu_detach_group,
 };
 
+struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
+		u64 offset)
+{
+	struct tce_container *container = iommu_data;
+	struct iommu_table *tbl = NULL;
+
+	if (tce_iommu_find_table(container, offset, &tbl) < 0)
+		return NULL;
+
+	iommu_table_get(tbl);
+
+	return tbl;
+}
+EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
+
 static int __init tce_iommu_init(void)
 {
 	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
@@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR(DRIVER_AUTHOR);
 MODULE_DESCRIPTION(DRIVER_DESC);
-
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b..1c2138a 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
 extern int vfio_external_user_iommu_id(struct vfio_group *group);
 extern long vfio_external_check_extension(struct vfio_group *group,
 					  unsigned long arg);
+extern struct vfio_container *vfio_container_get_ext(struct file *filep);
+extern void vfio_container_put_ext(struct vfio_container *container);
+extern void *vfio_container_get_iommu_data_ext(
+		struct vfio_container *container);
+extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
+		void *iommu_data, u64 offset);
 
 /*
  * Sub-module helpers
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH kernel 15/15] KVM: PPC: Add in-kernel acceleration for VFIO
  2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2016-08-03  8:40 ` [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users Alexey Kardashevskiy
@ 2016-08-03  8:40 ` Alexey Kardashevskiy
  14 siblings, 0 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-03  8:40 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Alex Williamson, Paul Mackerras

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space; this is not expected
to happen ever though.

The first user of this is VFIO on POWER. Trampolines to the VFIO external
user API functions are required for this patch.

This adds ioctl() interface to SPAPR TCE fd which already handles
in-kernel acceleration for emulated IO by allocating the guest view of
the TCE table in KVM. New ioctls allows the userspace to attach/detach
VFIO containers to the kernel-allocated TCE table and handle
the hardware TCE table updates in the kernel. The new interface
accepts VFIO container fd and uses exported API to get to the actual
hardware TCE table. Until _unset() ioctl is called, the VFIO container
is referenced to guarantee the TCE table presense in the memory.

This also releases unused containers when new container is registered.
The criteria of "unused" is vfio_container_get_iommu_data_ext()
returning NULL which happens when the container fd is closed.

Note that this interface does not operate with IOMMU groups as
TCE tables are owned by VFIO containers (and even may have no IOMMU groups
attached).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_host.h |   8 +
 arch/powerpc/include/uapi/asm/kvm.h |  12 ++
 arch/powerpc/kvm/book3s_64_vio.c    | 403 ++++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c | 173 ++++++++++++++++
 arch/powerpc/kvm/powerpc.c          |   2 +
 5 files changed, 598 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ec35af3..3e3d65f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -182,6 +182,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_container {
+	struct list_head next;
+	struct rcu_head rcu;
+	struct vfio_container *vfiocontainer;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -190,6 +197,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head containers;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index c93cf35..cbeb7bb 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -342,6 +342,18 @@ struct kvm_create_spapr_tce_64 {
 	__u64 size;	/* in pages */
 };
 
+#define KVM_SPAPR_TCE			(':')
+#define KVM_SPAPR_TCE_VFIO_SET		_IOW(KVM_SPAPR_TCE,  0x00, \
+					     struct kvm_spapr_tce_vfio)
+#define KVM_SPAPR_TCE_VFIO_UNSET	_IOW(KVM_SPAPR_TCE,  0x01, \
+					     struct kvm_spapr_tce_vfio)
+
+struct kvm_spapr_tce_vfio {
+	__u32 argsz;
+	__u32 flags;
+	__u32 container_fd;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae..d420ee0 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/vfio.h>
+#include <linux/file.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,70 @@
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static struct iommu_table *kvm_vfio_container_spapr_tce_table_get_ext(
+		void *iommu_data, u64 offset)
+{
+	struct iommu_table *tbl;
+	struct iommu_table *(*fn)(void *, u64);
+
+	fn = symbol_get(vfio_container_spapr_tce_table_get_ext);
+	if (!fn)
+		return NULL;
+
+	tbl = fn(iommu_data, offset);
+
+	symbol_put(vfio_container_spapr_tce_table_get_ext);
+
+	return tbl;
+}
+
+static struct vfio_container *kvm_vfio_container_get_ext(struct file *filep)
+{
+	struct vfio_container *container;
+	struct vfio_container *(*fn)(struct file *);
+
+	fn = symbol_get(vfio_container_get_ext);
+	if (!fn)
+		return NULL;
+
+	container = fn(filep);
+
+	symbol_put(vfio_container_get_ext);
+
+	return container;
+}
+
+static void kvm_vfio_container_put_ext(struct vfio_container *container)
+{
+	void (*fn)(struct vfio_container *container);
+
+	fn = symbol_get(vfio_container_put_ext);
+	if (!fn)
+		return;
+
+	fn(container);
+
+	symbol_put(vfio_container_put_ext);
+}
+
+static void *kvm_vfio_container_get_iommu_data_ext(
+		struct vfio_container *container)
+{
+	void *iommu_data;
+	void *(*fn)(struct vfio_container *);
+
+	fn = symbol_get(vfio_container_get_iommu_data_ext);
+	if (!fn)
+		return NULL;
+
+	iommu_data = fn(container);
+
+	symbol_put(vfio_container_get_iommu_data_ext);
+
+	return iommu_data;
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,15 +158,39 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_release_container_cb(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_container *kc = container_of(head,
+			struct kvmppc_spapr_tce_container, rcu);
+
+	kvm_vfio_container_put_ext(kc->vfiocontainer);
+	iommu_table_put(kc->tbl);
+	kfree(kc);
+}
+
+static void kvm_spapr_tce_release_container(
+		struct kvmppc_spapr_tce_container *kc)
+{
+	list_del_rcu(&kc->next);
+	call_rcu(&kc->rcu, kvm_spapr_tce_release_container_cb);
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
 			struct kvmppc_spapr_tce_table, rcu);
 	unsigned long i, npages = kvmppc_tce_pages(stt->size);
+	struct kvmppc_spapr_tce_container *kc;
 
 	for (i = 0; i < npages; i++)
 		__free_page(stt->pages[i]);
 
+	while (!list_empty(&stt->containers)) {
+		kc = list_first_entry(&stt->containers,
+				struct kvmppc_spapr_tce_container, next);
+		kvm_spapr_tce_release_container(kc);
+	}
+
 	kfree(stt);
 }
 
@@ -141,9 +233,148 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static void kvm_spapr_tce_release_unused_containers(
+		struct kvmppc_spapr_tce_table *stt)
+{
+	struct kvmppc_spapr_tce_container *kc, *kctmp;
+
+	list_for_each_entry_safe(kc, kctmp, &stt->containers, next) {
+		if (kvm_vfio_container_get_iommu_data_ext(kc->vfiocontainer))
+			continue;
+
+		kvm_spapr_tce_release_container(kc);
+	}
+}
+
+static long kvm_spapr_tce_set_container(struct kvmppc_spapr_tce_table *stt,
+		int container_fd)
+{
+	void *iommu_data = NULL;
+	struct vfio_container *container;
+	struct iommu_table *tbl;
+	struct kvmppc_spapr_tce_container *kc;
+	struct fd f;
+
+	f = fdget(container_fd);
+	if (!f.file)
+		return -EBADF;
+
+	container = kvm_vfio_container_get_ext(f.file);
+	fdput(f);
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	iommu_data = kvm_vfio_container_get_iommu_data_ext(container);
+	if (!iommu_data) {
+		kvm_vfio_container_put_ext(container);
+		return -ENOENT;
+	}
+
+	list_for_each_entry_rcu(kc, &stt->containers, next) {
+		if (kc->vfiocontainer == container) {
+			kvm_vfio_container_put_ext(container);
+			return -EBUSY;
+		}
+	}
+
+	tbl = kvm_vfio_container_spapr_tce_table_get_ext(
+			iommu_data, stt->offset << stt->page_shift);
+
+	kc = kzalloc(sizeof(*kc), GFP_KERNEL);
+	kc->vfiocontainer = container;
+	kc->tbl = tbl;
+	list_add_rcu(&kc->next, &stt->containers);
+
+	return 0;
+}
+
+static long kvm_spapr_tce_unset_container(struct kvmppc_spapr_tce_table *stt,
+		int container_fd)
+{
+	struct vfio_container *container;
+	struct kvmppc_spapr_tce_container *kc;
+	struct fd f;
+	long ret;
+
+	f = fdget(container_fd);
+	if (!f.file)
+		return -EBADF;
+
+	container = kvm_vfio_container_get_ext(f.file);
+	fdput(f);
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	ret = -ENOENT;
+
+	list_for_each_entry_rcu(kc, &stt->containers, next) {
+		if (kc->vfiocontainer != container)
+			continue;
+
+		kvm_spapr_tce_release_container(kc);
+		ret = 0;
+		break;
+	}
+	kvm_vfio_container_put_ext(container);
+
+	return ret;
+}
+
+static long kvm_spapr_tce_unl_ioctl(struct file *filp,
+		unsigned int cmd, unsigned long arg)
+{
+	struct kvmppc_spapr_tce_table *stt = filp->private_data;
+	struct kvm_spapr_tce_vfio param;
+	unsigned long minsz;
+	long ret = -EINVAL;
+
+	if (!stt)
+		return ret;
+
+	minsz = offsetofend(struct kvm_spapr_tce_vfio, container_fd);
+
+	if (copy_from_user(&param, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (param.argsz < minsz)
+		return -EINVAL;
+
+	if (param.flags)
+		return -EINVAL;
+
+	mutex_lock(&stt->kvm->lock);
+
+	switch (cmd) {
+	case KVM_SPAPR_TCE_VFIO_SET:
+		kvm_spapr_tce_release_unused_containers(stt);
+		ret = kvm_spapr_tce_set_container(stt, param.container_fd);
+		break;
+	case KVM_SPAPR_TCE_VFIO_UNSET:
+		ret = kvm_spapr_tce_unset_container(stt, param.container_fd);
+		break;
+	}
+
+	mutex_unlock(&stt->kvm->lock);
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long kvm_spapr_tce_compat_ioctl(struct file *filep,
+		unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return kvm_spapr_tce_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
 static const struct file_operations kvm_spapr_tce_fops = {
 	.mmap           = kvm_spapr_tce_mmap,
 	.release	= kvm_spapr_tce_release,
+	.unlocked_ioctl	= kvm_spapr_tce_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= kvm_spapr_tce_compat_ioctl,
+#endif
 };
 
 long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
@@ -181,6 +412,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->containers);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +441,160 @@ fail:
 	return ret;
 }
 
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		return H_HARDWARE;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long idx, ret = H_HARDWARE;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return ret;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 __user *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_iommu_unmap(vcpu->kvm, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_container *kc;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,6 +611,12 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(kc, &stt->containers, next) {
+		ret = kvmppc_h_put_tce_iommu(vcpu, kc->tbl, liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -245,6 +632,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	unsigned long entry, ua = 0;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_container *kc;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -272,6 +660,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 	tces = (u64 __user *) ua;
 
+	list_for_each_entry_lockless(kc, &stt->containers, next) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+				kc->tbl, ioba, tces, npages);
+		if (ret != H_SUCCESS)
+			goto unlock_exit;
+	}
+
 	for (i = 0; i < npages; ++i) {
 		if (get_user(tce, tces + i)) {
 			ret = H_TOO_HARD;
@@ -299,6 +694,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_container *kc;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +708,13 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(kc, &stt->containers, next) {
+		ret = kvmppc_h_stuff_tce_iommu(vcpu, kc->tbl, liobn, ioba,
+				tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8a6834e..4bc09f4 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -190,11 +190,161 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
 	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
 }
 
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (!pua)
+		return H_SUCCESS;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_SUCCESS;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize);
+	if (!mem)
+		return H_HARDWARE;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+}
+
+long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
+		return H_HARDWARE;
+
+	mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_HARDWARE;
+
+	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_HARDWARE;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
+
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+	const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	const enum dma_data_direction dir = iommu_tce_direction(tce);
+
+	/* Clear TCE */
+	if (dir == DMA_NONE) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry);
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, gpa))
+		return H_PARAMETER;
+
+	return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir);
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		u64 *tces, unsigned long npages)
+{
+	unsigned long i, ret, tce, gpa;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; ++i) {
+		gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << tbl->it_page_shift), gpa))
+			return H_PARAMETER;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		tce = be64_to_cpu(tces[i]);
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa,
+				iommu_tce_direction(tce));
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	unsigned long i;
+	const unsigned long entry = ioba >> tbl->it_page_shift;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i);
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_container *kc;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -211,6 +361,13 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	list_for_each_entry_lockless(kc, &stt->containers, next) {
+		ret = kvmppc_rm_h_put_tce_iommu(vcpu, kc->tbl,
+				liobn, ioba, tce);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
 
 	return H_SUCCESS;
@@ -278,6 +435,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		 * depend on hpt.
 		 */
 		struct mm_iommu_table_group_mem_t *mem;
+		struct kvmppc_spapr_tce_container *kc;
 
 		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
 			return H_TOO_HARD;
@@ -285,6 +443,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
 		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
 			return H_TOO_HARD;
+
+		list_for_each_entry_lockless(kc, &stt->containers, next) {
+			ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+					kc->tbl, ioba, (u64 *)tces, npages);
+			if (ret != H_SUCCESS)
+				return ret;
+		}
 	} else {
 		/*
 		 * This is emulated devices case.
@@ -334,6 +499,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_container *kc;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -347,6 +513,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(kc, &stt->containers, next) {
+		ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, kc->tbl,
+				liobn, ioba, tce_value, npages);
+		if (ret != H_SUCCESS)
+			return ret;
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6ce40dd..303d393 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -524,6 +524,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_ALLOC_HTAB:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-03  8:40 ` [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
@ 2016-08-03 10:10   ` Nicholas Piggin
  2016-08-05  7:00   ` Michael Ellerman
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Nicholas Piggin @ 2016-08-03 10:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, David Gibson

On Wed,  3 Aug 2016 18:40:46 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better cache @mm and use it later when the process is gone
> (@current or @current->mm are NULL).
> 
> This changes mm_iommu_xxx API to receive mm_struct instead of using one
> from @current.
> 
> This is needed by the following patch to do proper cleanup in time.
> This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
> to do proper cleanup via tce_iommu_clear() patch.
> 
> To keep API consistent, this replaces mm_context_t with mm_struct;
> we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
> access to &mm->mmap_sem.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

I still have some questions about the use of mm in the driver, but
those aren't issues introduced by this patch, so as it is I think
the bug fix of this and the next patch is good.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit
  2016-08-03  8:40 ` [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit Alexey Kardashevskiy
@ 2016-08-03 10:11   ` Nicholas Piggin
  2016-08-12  3:13   ` David Gibson
  1 sibling, 0 replies; 60+ messages in thread
From: Nicholas Piggin @ 2016-08-03 10:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Paul Mackerras, David Gibson

On Wed,  3 Aug 2016 18:40:47 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
> the userspace starts using VFIO. When the userspace process finishes,
> all the pinned pages need to be put; this is done as a part of
> the userspace memory context (MM) destruction which happens on
> the very last mmdrop().
> 
> This approach has a problem that a MM of the userspace process
> may live longer than the userspace process itself as kernel threads
> use userspace process MMs which was runnning on a CPU where
> the kernel thread was scheduled to. If this happened, the MM remains
> referenced until this exact kernel thread wakes up again
> and releases the very last reference to the MM, on an idle system this
> can take even hours.
> 
> This references and caches MM once per container and adds tracking
> how many times each preregistered area was registered in
> a specific container. This way we do not depend on @current pointing to
> a valid task descriptor.
> 
> This changes the userspace interface to return EBUSY if memory is
> already registered (mm_iommu_get() used to increment the counter);
> however it should not have any practical effect as the only
> userspace tool available now does register memory area once per
> container anyway.
> 
> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> under container->lock, this does not need additional locking.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER
  2016-08-03  8:40 ` [PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER Alexey Kardashevskiy
@ 2016-08-04  5:21   ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-04  5:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 2052 bytes --]

On Wed, Aug 03, 2016 at 06:40:43PM +1000, Alexey Kardashevskiy wrote:
> 178a787502 "vfio: Enable VFIO device for powerpc" made an attempt to
> enable VFIO KVM device on POWER.
> 
> However as CONFIG_KVM_BOOK3S_64 does not use "common-objs-y",
> VFIO KVM device was not enabled for Book3s KVM, this adds VFIO to
> the kvm-book3s_64-objs-y list.
> 
> While we are here, enforce KVM_VFIO on KVM_BOOK3S as other platforms
> already do.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

This should be merged regardless of the rest of the series.  There's
no reason not to include the kvm device on Power, and it makes life
easier for userspace because it doens't have to have conditionals
about whether to instantiate it or not.

> ---
>  arch/powerpc/kvm/Kconfig  | 1 +
>  arch/powerpc/kvm/Makefile | 3 +++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index c2024ac..b7c494b 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -64,6 +64,7 @@ config KVM_BOOK3S_64
>  	select KVM_BOOK3S_64_HANDLER
>  	select KVM
>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> +	select KVM_VFIO if VFIO
>  	---help---
>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>  	  in virtual machines on book3s_64 host processors.
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 1f9e552..8907af9 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -88,6 +88,9 @@ endif
>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>  	book3s_xics.o
>  
> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> +	$(KVM)/vfio.o
> +
>  kvm-book3s_64-module-objs += \
>  	$(KVM)/kvm_main.o \
>  	$(KVM)/eventfd.o \

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again
  2016-08-03  8:40 ` [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again Alexey Kardashevskiy
@ 2016-08-04  5:23   ` David Gibson
  2016-08-09 11:26   ` [kernel, " Michael Ellerman
  1 sibling, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-04  5:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]

On Wed, Aug 03, 2016 at 06:40:45PM +1000, Alexey Kardashevskiy wrote:
> "powerpc/powernv/pci: Rework accessing the TCE invalidate register"
> broke TCE invalidation on IODA2/PHB3 for real mode.
> 
> This makes invalidate work again.
> 
> Fixes: fd141d1a99a3
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 53b56c0..59c7e7d 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1877,7 +1877,7 @@ static void pnv_pci_phb3_tce_invalidate(struct pnv_ioda_pe *pe, bool rm,
>  					unsigned shift, unsigned long index,
>  					unsigned long npages)
>  {
> -	__be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, false);
> +	__be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, rm);
>  	unsigned long start, end, inc;
>  
>  	/* We'll invalidate DMA address in PE scope */
> @@ -1935,10 +1935,12 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>  			pnv_pci_phb3_tce_invalidate(pe, rm, shift,
>  						    index, npages);
>  		else if (rm)
> +		{
>  			opal_rm_pci_tce_kill(phb->opal_id,
>  					     OPAL_PCI_TCE_KILL_PAGES,
>  					     pe->pe_number, 1u << shift,
>  					     index << shift, npages);
> +		}

These braces look a) unrelated to the actual point of the patch, b)
unnecessary and c) not in keeping with normal coding style.

>  		else
>  			opal_pci_tce_kill(phb->opal_id,
>  					  OPAL_PCI_TCE_KILL_PAGES,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-03  8:40 ` [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
  2016-08-03 10:10   ` Nicholas Piggin
@ 2016-08-05  7:00   ` Michael Ellerman
  2016-08-09  5:29     ` Alexey Kardashevskiy
  2016-08-09  4:43   ` Balbir Singh
  2016-08-12  2:57   ` David Gibson
  3 siblings, 1 reply; 60+ messages in thread
From: Michael Ellerman @ 2016-08-05  7:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, Paul Mackerras, David Gibson

Alexey Kardashevskiy <aik@ozlabs.ru> writes:

> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better cache @mm and use it later when the process is gone
> (@current or @current->mm are NULL).
>
> This changes mm_iommu_xxx API to receive mm_struct instead of using one
> from @current.
>
> This is needed by the following patch to do proper cleanup in time.
> This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
> to do proper cleanup via tce_iommu_clear() patch.
>
> To keep API consistent, this replaces mm_context_t with mm_struct;
> we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
> access to &mm->mmap_sem.
>
> This should cause no behavioral change.

Is this a theoretical bug, or do we hit it in practice?

In other words, should I merge this as a fix for 4.8, or can it wait for
4.9 with the rest of the series?

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/mmu_context.h | 20 +++++++------
>  arch/powerpc/kernel/setup-common.c     |  2 +-
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
>  arch/powerpc/mm/mmu_context_iommu.c    | 54 ++++++++++++++--------------------

>  drivers/vfio/vfio_iommu_spapr_tce.c    | 41 ++++++++++++++++----------

I'd need an ACK from Alex for that part.

cheers

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-03  8:40 ` [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users Alexey Kardashevskiy
@ 2016-08-08 16:43   ` Alex Williamson
  2016-08-09  5:19     ` Alexey Kardashevskiy
  2016-08-12  5:25   ` David Gibson
  1 sibling, 1 reply; 60+ messages in thread
From: Alex Williamson @ 2016-08-08 16:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, David Gibson, Paul Mackerras

On Wed,  3 Aug 2016 18:40:55 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This exports helpers which are needed to keep a VFIO container in
> memory while there are external users such as KVM.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>  include/linux/vfio.h                |  6 ++++++
>  3 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index d1d70e0..baf6a9c 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>  
>  /**
> + * External user API for containers, exported by symbols to be linked
> + * dynamically.
> + *
> + */
> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> +{
> +	struct vfio_container *container = filep->private_data;
> +
> +	if (filep->f_op != &vfio_fops)
> +		return ERR_PTR(-EINVAL);
> +
> +	vfio_container_get(container);
> +
> +	return container;
> +}
> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> +
> +void vfio_container_put_ext(struct vfio_container *container)
> +{
> +	vfio_container_put(container);
> +}
> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> +
> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> +{
> +	return container->iommu_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> +
> +/**
>   * Sub-module support
>   */
>  /*
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 3594ad3..fceea3d 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>  	.detach_group	= tce_iommu_detach_group,
>  };
>  
> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> +		u64 offset)
> +{
> +	struct tce_container *container = iommu_data;
> +	struct iommu_table *tbl = NULL;
> +
> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> +		return NULL;
> +
> +	iommu_table_get(tbl);
> +
> +	return tbl;
> +}
> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> +
>  static int __init tce_iommu_init(void)
>  {
>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>  MODULE_LICENSE("GPL v2");
>  MODULE_AUTHOR(DRIVER_AUTHOR);
>  MODULE_DESCRIPTION(DRIVER_DESC);
> -
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b..1c2138a 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>  extern long vfio_external_check_extension(struct vfio_group *group,
>  					  unsigned long arg);
> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> +extern void vfio_container_put_ext(struct vfio_container *container);
> +extern void *vfio_container_get_iommu_data_ext(
> +		struct vfio_container *container);
> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> +		void *iommu_data, u64 offset);
>  
>  /*
>   * Sub-module helpers


I think you need to take a closer look of the lifecycle of a container,
having a reference means the container itself won't go away, but only
having a group set within that container holds the actual IOMMU
references.  container->iommu_data is going to be NULL once the
groups are lost.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-03  8:40 ` [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
  2016-08-03 10:10   ` Nicholas Piggin
  2016-08-05  7:00   ` Michael Ellerman
@ 2016-08-09  4:43   ` Balbir Singh
  2016-08-09  6:04     ` Nicholas Piggin
  2016-08-12  2:57   ` David Gibson
  3 siblings, 1 reply; 60+ messages in thread
From: Balbir Singh @ 2016-08-09  4:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alex Williamson, Paul Mackerras, David Gibson



On 03/08/16 18:40, Alexey Kardashevskiy wrote:
> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better cache @mm and use it later when the process is gone
> (@current or @current->mm are NULL).
> 
> This changes mm_iommu_xxx API to receive mm_struct instead of using one
> from @current.
> 
> This is needed by the following patch to do proper cleanup in time.
> This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
> to do proper cleanup via tce_iommu_clear() patch.
> 
> To keep API consistent, this replaces mm_context_t with mm_struct;
> we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
> access to &mm->mmap_sem.
> 
> This should cause no behavioral change.
> 

Looks good, minor nits below

Acked-by: Balbir Singh <bsingharora@gmail.com>

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/mmu_context.h | 20 +++++++------
>  arch/powerpc/kernel/setup-common.c     |  2 +-
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
>  arch/powerpc/mm/mmu_context_iommu.c    | 54 ++++++++++++++--------------------
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 41 ++++++++++++++++----------
>  5 files changed, 62 insertions(+), 59 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 9d2cd0c..b85cc7b 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -18,16 +18,18 @@ extern void destroy_context(struct mm_struct *mm);
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>  struct mm_iommu_table_group_mem_t;
>  
> -extern bool mm_iommu_preregistered(void);
> -extern long mm_iommu_get(unsigned long ua, unsigned long entries,
> +extern bool mm_iommu_preregistered(struct mm_struct *mm);
> +extern long mm_iommu_get(struct mm_struct *mm,
> +		unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem);
> -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> -extern void mm_iommu_init(mm_context_t *ctx);
> -extern void mm_iommu_cleanup(mm_context_t *ctx);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> -		unsigned long size);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
> -		unsigned long entries);
> +extern long mm_iommu_put(struct mm_struct *mm,
> +		struct mm_iommu_table_group_mem_t *mem);
> +extern void mm_iommu_init(struct mm_struct *mm);
> +extern void mm_iommu_cleanup(struct mm_struct *mm);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
> +		unsigned long ua, unsigned long size);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> index 714b4ba..e90b68a 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -905,7 +905,7 @@ void __init setup_arch(char **cmdline_p)
>  	init_mm.context.pte_frag = NULL;
>  #endif
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_init(&init_mm.context);
> +	mm_iommu_init(&init_mm);
>  #endif
>  	irqstack_early_init();
>  	exc_lvl_early_init();
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index b114f8b..ad82735 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>  	mm->context.pte_frag = NULL;
>  #endif
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_init(&mm->context);
> +	mm_iommu_init(mm);
>  #endif
>  	return 0;
>  }
> @@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>  void destroy_context(struct mm_struct *mm)
>  {
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_cleanup(&mm->context);
> +	mm_iommu_cleanup(mm);
>  #endif
>  
>  #ifdef CONFIG_PPC_ICSWX
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index da6a216..ee6685b 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>  	}
>  
>  	pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
> -			current->pid,
> +			current ? current->pid : 0,
>  			incr ? '+' : '-',
>  			npages << PAGE_SHIFT,
>  			mm->locked_vm << PAGE_SHIFT,
> @@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>  	return ret;
>  }
>  
> -bool mm_iommu_preregistered(void)
> +bool mm_iommu_preregistered(struct mm_struct *mm)
>  {
> -	if (!current || !current->mm)
> -		return false;
> -
> -	return !list_empty(&current->mm->context.iommu_group_mem_list);
> +	return !list_empty(&mm->context.iommu_group_mem_list);
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> -long mm_iommu_get(unsigned long ua, unsigned long entries,
> +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  	long i, j, ret = 0, locked_entries = 0;
>  	struct page *page = NULL;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */

VM_BUG_ON(mm == NULL)?

> -
>  	mutex_lock(&mem_list_mutex);
>  
> -	list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
>  			next) {
>  		if ((mem->ua == ua) && (mem->entries == entries)) {
>  			++mem->used;
> @@ -102,7 +96,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>  
>  	}
>  
> -	ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
> +	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
>  	if (ret)
>  		goto unlock_exit;
>  
> @@ -142,11 +136,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>  	mem->entries = entries;
>  	*pmem = mem;
>  
> -	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
> +	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
>  
>  unlock_exit:
>  	if (locked_entries && ret)
> -		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
> +		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
>  
>  	mutex_unlock(&mem_list_mutex);
>  
> @@ -191,16 +185,13 @@ static void mm_iommu_free(struct rcu_head *head)
>  static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	list_del_rcu(&mem->next);
> -	mm_iommu_adjust_locked_vm(current->mm, mem->entries, false);
>  	call_rcu(&mem->rcu, mm_iommu_free);
>  }
>  
> -long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
> +long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
>  {
>  	long ret = 0;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
>  
>  	mutex_lock(&mem_list_mutex);
>  
> @@ -224,6 +215,8 @@ long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
>  	/* @mapped became 0 so now mappings are disabled, release the region */
>  	mm_iommu_release(mem);
>  
> +	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
> +
>  unlock_exit:
>  	mutex_unlock(&mem_list_mutex);
>  
> @@ -231,14 +224,12 @@ unlock_exit:
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_put);
>  
> -struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> -		unsigned long size)
> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
> +		unsigned long ua, unsigned long size)
>  {
>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>  
> -	list_for_each_entry_rcu(mem,
> -			&current->mm->context.iommu_group_mem_list,
> -			next) {
> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
>  		if ((mem->ua <= ua) &&
>  				(ua + size <= mem->ua +
>  				 (mem->entries << PAGE_SHIFT))) {
> @@ -251,14 +242,12 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_lookup);
>  
> -struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
> -		unsigned long entries)
> +struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +		unsigned long ua, unsigned long entries)
>  {
>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>  
> -	list_for_each_entry_rcu(mem,
> -			&current->mm->context.iommu_group_mem_list,
> -			next) {
> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
>  		if ((mem->ua == ua) && (mem->entries == entries)) {
>  			ret = mem;
>  			break;
> @@ -300,16 +289,17 @@ void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
>  
> -void mm_iommu_init(mm_context_t *ctx)
> +void mm_iommu_init(struct mm_struct *mm)
>  {
> -	INIT_LIST_HEAD_RCU(&ctx->iommu_group_mem_list);
> +	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>  }
>  
> -void mm_iommu_cleanup(mm_context_t *ctx)
> +void mm_iommu_cleanup(struct mm_struct *mm)
>  {
>  	struct mm_iommu_table_group_mem_t *mem, *tmp;
>  
> -	list_for_each_entry_safe(mem, tmp, &ctx->iommu_group_mem_list, next) {
> +	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> +			next) {
>  		list_del_rcu(&mem->next);
>  		mm_iommu_do_free(mem);
>  	}
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 80378dd..9752e77 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -98,6 +98,7 @@ struct tce_container {
>  	bool enabled;
>  	bool v2;
>  	unsigned long locked_pages;
> +	struct mm_struct *mm;
>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>  	struct list_head group_list;
>  };
> @@ -110,11 +111,11 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
>  
> -	mem = mm_iommu_find(vaddr, size >> PAGE_SHIFT);
> +	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
>  	if (!mem)
>  		return -ENOENT;
>  
> -	return mm_iommu_put(mem);
> +	return mm_iommu_put(container->mm, mem);
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  			((vaddr + size) < vaddr))
>  		return -EINVAL;
>  
> -	ret = mm_iommu_get(vaddr, entries, &mem);
> +	if (!container->mm) {
> +		if (!current->mm)
> +			return -ESRCH; /* process exited */

You may even want to check for PF_EXITING and ignore those tasks?

> +
> +		atomic_inc(&current->mm->mm_count);
> +		container->mm = current->mm;
> +	}
> +
> +	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>  	if (ret)
>  		return ret;
> -
>  	container->enabled = true;
>  
>  	return 0;
> @@ -354,6 +362,8 @@ static void tce_iommu_release(void *iommu_data)
>  		tce_iommu_free_table(tbl);
>  	}
>  
> +	if (container->mm)
> +		mmdrop(container->mm);
>  	tce_iommu_disable(container);
>  	mutex_destroy(&container->lock);
>  
> @@ -369,13 +379,14 @@ static void tce_iommu_unuse_page(struct tce_container *container,
>  	put_page(page);
>  }
>  
> -static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
> +static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
> +		unsigned long tce, unsigned long size,
>  		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	long ret = 0;
>  	struct mm_iommu_table_group_mem_t *mem;
>  
> -	mem = mm_iommu_lookup(tce, size);
> +	mem = mm_iommu_lookup(container->mm, tce, size);
>  	if (!mem)
>  		return -EINVAL;
>  
> @@ -388,18 +399,18 @@ static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
>  	return 0;
>  }
>  
> -static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
> -		unsigned long entry)
> +static void tce_iommu_unuse_page_v2(struct tce_container *container,
> +		struct iommu_table *tbl, unsigned long entry)
>  {
>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>  	int ret;
>  	unsigned long hpa = 0;
>  	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>  
> -	if (!pua || !current || !current->mm)
> +	if (!pua)
>  		return;
>  
> -	ret = tce_iommu_prereg_ua_to_hpa(*pua, IOMMU_PAGE_SIZE(tbl),
> +	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
>  			&hpa, &mem);
>  	if (ret)
>  		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
> @@ -429,7 +440,7 @@ static int tce_iommu_clear(struct tce_container *container,
>  			continue;
>  
>  		if (container->v2) {
> -			tce_iommu_unuse_page_v2(tbl, entry);
> +			tce_iommu_unuse_page_v2(container, tbl, entry);
>  			continue;
>  		}
>  
> @@ -514,8 +525,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
>  				entry + i);
>  
> -		ret = tce_iommu_prereg_ua_to_hpa(tce, IOMMU_PAGE_SIZE(tbl),
> -				&hpa, &mem);
> +		ret = tce_iommu_prereg_ua_to_hpa(container,
> +				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
>  		if (ret)
>  			break;
>  
> @@ -536,7 +547,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
>  		if (ret) {
>  			/* dirtmp cannot be DMA_NONE here */
> -			tce_iommu_unuse_page_v2(tbl, entry + i);
> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>  					__func__, entry << tbl->it_page_shift,
>  					tce, ret);
> @@ -544,7 +555,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		}
>  
>  		if (dirtmp != DMA_NONE)
> -			tce_iommu_unuse_page_v2(tbl, entry + i);
> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
>  
>  		*pua = tce;
>  
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-08 16:43   ` Alex Williamson
@ 2016-08-09  5:19     ` Alexey Kardashevskiy
  2016-08-09 12:16       ` Alex Williamson
  0 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-09  5:19 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras

On 09/08/16 02:43, Alex Williamson wrote:
> On Wed,  3 Aug 2016 18:40:55 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> This exports helpers which are needed to keep a VFIO container in
>> memory while there are external users such as KVM.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>  include/linux/vfio.h                |  6 ++++++
>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index d1d70e0..baf6a9c 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>  
>>  /**
>> + * External user API for containers, exported by symbols to be linked
>> + * dynamically.
>> + *
>> + */
>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>> +{
>> +	struct vfio_container *container = filep->private_data;
>> +
>> +	if (filep->f_op != &vfio_fops)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	vfio_container_get(container);
>> +
>> +	return container;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>> +
>> +void vfio_container_put_ext(struct vfio_container *container)
>> +{
>> +	vfio_container_put(container);
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>> +
>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>> +{
>> +	return container->iommu_data;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>> +
>> +/**
>>   * Sub-module support
>>   */
>>  /*
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 3594ad3..fceea3d 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>  	.detach_group	= tce_iommu_detach_group,
>>  };
>>  
>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>> +		u64 offset)
>> +{
>> +	struct tce_container *container = iommu_data;
>> +	struct iommu_table *tbl = NULL;
>> +
>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>> +		return NULL;
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	return tbl;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>> +
>>  static int __init tce_iommu_init(void)
>>  {
>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>  MODULE_LICENSE("GPL v2");
>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>  MODULE_DESCRIPTION(DRIVER_DESC);
>> -
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 0ecae0b..1c2138a 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>  					  unsigned long arg);
>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>> +extern void vfio_container_put_ext(struct vfio_container *container);
>> +extern void *vfio_container_get_iommu_data_ext(
>> +		struct vfio_container *container);
>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>> +		void *iommu_data, u64 offset);
>>  
>>  /*
>>   * Sub-module helpers
> 
> 
> I think you need to take a closer look of the lifecycle of a container,
> having a reference means the container itself won't go away, but only
> having a group set within that container holds the actual IOMMU
> references.  container->iommu_data is going to be NULL once the
> groups are lost.  Thanks,


Container owns the iommu tables and this is what I care about here, groups
attached or not - this is handled separately via IOMMU group list in a
specific iommu_table struct, these groups get detached from iommu_table
when they are removed from a container.


-- 
Alexey

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-05  7:00   ` Michael Ellerman
@ 2016-08-09  5:29     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-09  5:29 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev
  Cc: Alex Williamson, Paul Mackerras, David Gibson

On 05/08/16 17:00, Michael Ellerman wrote:
> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> 
>> In some situations the userspace memory context may live longer than
>> the userspace process itself so if we need to do proper memory context
>> cleanup, we better cache @mm and use it later when the process is gone
>> (@current or @current->mm are NULL).
>>
>> This changes mm_iommu_xxx API to receive mm_struct instead of using one
>> from @current.
>>
>> This is needed by the following patch to do proper cleanup in time.
>> This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
>> to do proper cleanup via tce_iommu_clear() patch.
>>
>> To keep API consistent, this replaces mm_context_t with mm_struct;
>> we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
>> access to &mm->mmap_sem.
>>
>> This should cause no behavioral change.
> 
> Is this a theoretical bug, or do we hit it in practice?

Actual bug.

> 
> In other words, should I merge this as a fix for 4.8, or can it wait for
> 4.9 with the rest of the series?

Assuming this does not have "rb" or "ab" from anyone familiar with IOMMU on
powernv, this has to wait :-/

> 
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/mmu_context.h | 20 +++++++------
>>  arch/powerpc/kernel/setup-common.c     |  2 +-
>>  arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
>>  arch/powerpc/mm/mmu_context_iommu.c    | 54 ++++++++++++++--------------------
> 
>>  drivers/vfio/vfio_iommu_spapr_tce.c    | 41 ++++++++++++++++----------
> 
> I'd need an ACK from Alex for that part.



-- 
Alexey

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-09  4:43   ` Balbir Singh
@ 2016-08-09  6:04     ` Nicholas Piggin
  2016-08-09  6:17       ` Balbir Singh
  0 siblings, 1 reply; 60+ messages in thread
From: Nicholas Piggin @ 2016-08-09  6:04 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Alexey Kardashevskiy, linuxppc-dev, Alex Williamson,
	Paul Mackerras, David Gibson

On Tue, 9 Aug 2016 14:43:00 +1000
Balbir Singh <bsingharora@gmail.com> wrote:

> On 03/08/16 18:40, Alexey Kardashevskiy wrote:

> > -long mm_iommu_get(unsigned long ua, unsigned long entries,
> > +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> >  		struct mm_iommu_table_group_mem_t **pmem)
> >  {
> >  	struct mm_iommu_table_group_mem_t *mem;
> >  	long i, j, ret = 0, locked_entries = 0;
> >  	struct page *page = NULL;
> >  
> > -	if (!current || !current->mm)
> > -		return -ESRCH; /* process exited */  
> 
> VM_BUG_ON(mm == NULL)?


> > @@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
> >  			((vaddr + size) < vaddr))
> >  		return -EINVAL;
> >  
> > -	ret = mm_iommu_get(vaddr, entries, &mem);
> > +	if (!container->mm) {
> > +		if (!current->mm)
> > +			return -ESRCH; /* process exited */  
> 
> You may even want to check for PF_EXITING and ignore those tasks?


These are related to some of the questions I had about the patch.

But I think it makes sense just to take this approach as a minimal
bug fix without changing logic too much or adding BUG_ONs, and then
if we we can consider how iommu takes references to mm and uses it
(if anybody finds the time).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-09  6:04     ` Nicholas Piggin
@ 2016-08-09  6:17       ` Balbir Singh
  0 siblings, 0 replies; 60+ messages in thread
From: Balbir Singh @ 2016-08-09  6:17 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Alexey Kardashevskiy, linuxppc-dev, Alex Williamson,
	Paul Mackerras, David Gibson



On 09/08/16 16:04, Nicholas Piggin wrote:
> On Tue, 9 Aug 2016 14:43:00 +1000
> Balbir Singh <bsingharora@gmail.com> wrote:
> 
>> On 03/08/16 18:40, Alexey Kardashevskiy wrote:
> 
>>> -long mm_iommu_get(unsigned long ua, unsigned long entries,
>>> +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>>>  		struct mm_iommu_table_group_mem_t **pmem)
>>>  {
>>>  	struct mm_iommu_table_group_mem_t *mem;
>>>  	long i, j, ret = 0, locked_entries = 0;
>>>  	struct page *page = NULL;
>>>  
>>> -	if (!current || !current->mm)
>>> -		return -ESRCH; /* process exited */  
>>
>> VM_BUG_ON(mm == NULL)?
> 
> 
>>> @@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
>>>  			((vaddr + size) < vaddr))
>>>  		return -EINVAL;
>>>  
>>> -	ret = mm_iommu_get(vaddr, entries, &mem);
>>> +	if (!container->mm) {
>>> +		if (!current->mm)
>>> +			return -ESRCH; /* process exited */  
>>
>> You may even want to check for PF_EXITING and ignore those tasks?
> 
> 
> These are related to some of the questions I had about the patch.
> 
> But I think it makes sense just to take this approach as a minimal
> bug fix without changing logic too much or adding BUG_ONs, and then
> if we we can consider how iommu takes references to mm and uses it
> (if anybody finds the time).
> 

Agreed

Balbir

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [kernel, 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again
  2016-08-03  8:40 ` [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again Alexey Kardashevskiy
  2016-08-04  5:23   ` David Gibson
@ 2016-08-09 11:26   ` Michael Ellerman
  1 sibling, 0 replies; 60+ messages in thread
From: Michael Ellerman @ 2016-08-09 11:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, Paul Mackerras, David Gibson

On Wed, 2016-03-08 at 08:40:45 UTC, Alexey Kardashevskiy wrote:
> "powerpc/powernv/pci: Rework accessing the TCE invalidate register"
> broke TCE invalidation on IODA2/PHB3 for real mode.
> 
> This makes invalidate work again.
> 
> Fixes: fd141d1a99a3
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/4d9021957b5218310e28767f25

cheers

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-09  5:19     ` Alexey Kardashevskiy
@ 2016-08-09 12:16       ` Alex Williamson
  2016-08-10  5:37         ` Alexey Kardashevskiy
  2016-08-15  3:59         ` Paul Mackerras
  0 siblings, 2 replies; 60+ messages in thread
From: Alex Williamson @ 2016-08-09 12:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, David Gibson, Paul Mackerras

On Tue, 9 Aug 2016 15:19:39 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 09/08/16 02:43, Alex Williamson wrote:
> > On Wed,  3 Aug 2016 18:40:55 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> This exports helpers which are needed to keep a VFIO container in
> >> memory while there are external users such as KVM.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> >>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> >>  include/linux/vfio.h                |  6 ++++++
> >>  3 files changed, 51 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index d1d70e0..baf6a9c 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> >>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> >>  
> >>  /**
> >> + * External user API for containers, exported by symbols to be linked
> >> + * dynamically.
> >> + *
> >> + */
> >> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> >> +{
> >> +	struct vfio_container *container = filep->private_data;
> >> +
> >> +	if (filep->f_op != &vfio_fops)
> >> +		return ERR_PTR(-EINVAL);
> >> +
> >> +	vfio_container_get(container);
> >> +
> >> +	return container;
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> >> +
> >> +void vfio_container_put_ext(struct vfio_container *container)
> >> +{
> >> +	vfio_container_put(container);
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> >> +
> >> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> >> +{
> >> +	return container->iommu_data;
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> >> +
> >> +/**
> >>   * Sub-module support
> >>   */
> >>  /*
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> index 3594ad3..fceea3d 100644
> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> >>  	.detach_group	= tce_iommu_detach_group,
> >>  };
> >>  
> >> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> >> +		u64 offset)
> >> +{
> >> +	struct tce_container *container = iommu_data;
> >> +	struct iommu_table *tbl = NULL;
> >> +
> >> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> >> +		return NULL;
> >> +
> >> +	iommu_table_get(tbl);
> >> +
> >> +	return tbl;
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> >> +
> >>  static int __init tce_iommu_init(void)
> >>  {
> >>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> >> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> >>  MODULE_LICENSE("GPL v2");
> >>  MODULE_AUTHOR(DRIVER_AUTHOR);
> >>  MODULE_DESCRIPTION(DRIVER_DESC);
> >> -
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index 0ecae0b..1c2138a 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> >>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >>  extern long vfio_external_check_extension(struct vfio_group *group,
> >>  					  unsigned long arg);
> >> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> >> +extern void vfio_container_put_ext(struct vfio_container *container);
> >> +extern void *vfio_container_get_iommu_data_ext(
> >> +		struct vfio_container *container);
> >> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> >> +		void *iommu_data, u64 offset);
> >>  
> >>  /*
> >>   * Sub-module helpers  
> > 
> > 
> > I think you need to take a closer look of the lifecycle of a container,
> > having a reference means the container itself won't go away, but only
> > having a group set within that container holds the actual IOMMU
> > references.  container->iommu_data is going to be NULL once the
> > groups are lost.  Thanks,  
> 
> 
> Container owns the iommu tables and this is what I care about here, groups
> attached or not - this is handled separately via IOMMU group list in a
> specific iommu_table struct, these groups get detached from iommu_table
> when they are removed from a container.

The container doesn't own anything, the container is privileged by the
groups being attached to it.  When groups are closed, they detach from
the container and once the container group list is empty the iommu
backend is released and iommu_data is NULL.  A container reference
doesn't give you what you're looking for.  It implies nothing about the
iommu backend.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-09 12:16       ` Alex Williamson
@ 2016-08-10  5:37         ` Alexey Kardashevskiy
  2016-08-10 16:46           ` Alex Williamson
  2016-08-15  3:59         ` Paul Mackerras
  1 sibling, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-10  5:37 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linuxppc-dev, David Gibson, Paul Mackerras

On 09/08/16 22:16, Alex Williamson wrote:
> On Tue, 9 Aug 2016 15:19:39 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 09/08/16 02:43, Alex Williamson wrote:
>>> On Wed,  3 Aug 2016 18:40:55 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> This exports helpers which are needed to keep a VFIO container in
>>>> memory while there are external users such as KVM.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>  include/linux/vfio.h                |  6 ++++++
>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>> index d1d70e0..baf6a9c 100644
>>>> --- a/drivers/vfio/vfio.c
>>>> +++ b/drivers/vfio/vfio.c
>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>  
>>>>  /**
>>>> + * External user API for containers, exported by symbols to be linked
>>>> + * dynamically.
>>>> + *
>>>> + */
>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>> +{
>>>> +	struct vfio_container *container = filep->private_data;
>>>> +
>>>> +	if (filep->f_op != &vfio_fops)
>>>> +		return ERR_PTR(-EINVAL);
>>>> +
>>>> +	vfio_container_get(container);
>>>> +
>>>> +	return container;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>> +
>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>> +{
>>>> +	vfio_container_put(container);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>> +
>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>> +{
>>>> +	return container->iommu_data;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>> +
>>>> +/**
>>>>   * Sub-module support
>>>>   */
>>>>  /*
>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> index 3594ad3..fceea3d 100644
>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>  	.detach_group	= tce_iommu_detach_group,
>>>>  };
>>>>  
>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>> +		u64 offset)
>>>> +{
>>>> +	struct tce_container *container = iommu_data;
>>>> +	struct iommu_table *tbl = NULL;
>>>> +
>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>> +		return NULL;
>>>> +
>>>> +	iommu_table_get(tbl);
>>>> +
>>>> +	return tbl;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>> +
>>>>  static int __init tce_iommu_init(void)
>>>>  {
>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>  MODULE_LICENSE("GPL v2");
>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
>>>> -
>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>> index 0ecae0b..1c2138a 100644
>>>> --- a/include/linux/vfio.h
>>>> +++ b/include/linux/vfio.h
>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>>>  					  unsigned long arg);
>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>> +		struct vfio_container *container);
>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>> +		void *iommu_data, u64 offset);
>>>>  
>>>>  /*
>>>>   * Sub-module helpers  
>>>
>>>
>>> I think you need to take a closer look of the lifecycle of a container,
>>> having a reference means the container itself won't go away, but only
>>> having a group set within that container holds the actual IOMMU
>>> references.  container->iommu_data is going to be NULL once the
>>> groups are lost.  Thanks,  
>>
>>
>> Container owns the iommu tables and this is what I care about here, groups
>> attached or not - this is handled separately via IOMMU group list in a
>> specific iommu_table struct, these groups get detached from iommu_table
>> when they are removed from a container.
> 
> The container doesn't own anything, the container is privileged by the
> groups being attached to it.  When groups are closed, they detach from
> the container and once the container group list is empty the iommu
> backend is released and iommu_data is NULL.  A container reference
> doesn't give you what you're looking for.  It implies nothing about the
> iommu backend.


Well. Backend is a part of a container and since a backend owns tables, a
container owns them too.

The problem I am trying to solve here is when KVM may release the
iommu_table objects.

"Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
matter) makes a link between KVM-spapr-tce-table and container and KVM can
start using tables (with referencing them).

First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
from region_del() and this works if QEMU removes a window. However if QEMU
removes a vfio-pci device, region_del() is not called and KVM does not get
notified that it can release the iommu_table's because the
KVM-spapr-tce-table remains alive and does not get destroyed (as it is
still used by emulated devices or other containers).

So it was suggested that we could do such "unset" somehow later assuming,
for example, on every "set" I could check if some of currently attached
containers are no more used - and this is where being able to know if there
is no backend helps - KVM remembers a container pointer and can check this
via vfio_container_get_iommu_data_ext().

The other option would be changing vfio_container_get_ext() to take a
callback+opaque which container would call when it destroys iommu_data.
This looks more intrusive and not very intuitive how to make it right -
container would have to keep track of all registered external users and
vfio_container_put_ext() would have to pass the same callback+opaque to
unregister the exact external user.


Or I could store container file* in KVM. Then iommu_data would never be
released until KVM-spapr-tce-table is destroyed.

Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
would "unset" container from KVM-spapr-tce-table) is not an option as there
still may be devices using this KVM-spapr-tce-table.

What obvious and nice solution am I missing here? Thanks.


-- 
Alexey

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-10  5:37         ` Alexey Kardashevskiy
@ 2016-08-10 16:46           ` Alex Williamson
  2016-08-12  5:46             ` David Gibson
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Williamson @ 2016-08-10 16:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, David Gibson, Paul Mackerras

On Wed, 10 Aug 2016 15:37:17 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 09/08/16 22:16, Alex Williamson wrote:
> > On Tue, 9 Aug 2016 15:19:39 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 09/08/16 02:43, Alex Williamson wrote:  
> >>> On Wed,  3 Aug 2016 18:40:55 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>     
> >>>> This exports helpers which are needed to keep a VFIO container in
> >>>> memory while there are external users such as KVM.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> >>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> >>>>  include/linux/vfio.h                |  6 ++++++
> >>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>> index d1d70e0..baf6a9c 100644
> >>>> --- a/drivers/vfio/vfio.c
> >>>> +++ b/drivers/vfio/vfio.c
> >>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> >>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> >>>>  
> >>>>  /**
> >>>> + * External user API for containers, exported by symbols to be linked
> >>>> + * dynamically.
> >>>> + *
> >>>> + */
> >>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> >>>> +{
> >>>> +	struct vfio_container *container = filep->private_data;
> >>>> +
> >>>> +	if (filep->f_op != &vfio_fops)
> >>>> +		return ERR_PTR(-EINVAL);
> >>>> +
> >>>> +	vfio_container_get(container);
> >>>> +
> >>>> +	return container;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> >>>> +
> >>>> +void vfio_container_put_ext(struct vfio_container *container)
> >>>> +{
> >>>> +	vfio_container_put(container);
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> >>>> +
> >>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> >>>> +{
> >>>> +	return container->iommu_data;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> >>>> +
> >>>> +/**
> >>>>   * Sub-module support
> >>>>   */
> >>>>  /*
> >>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> index 3594ad3..fceea3d 100644
> >>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> >>>>  	.detach_group	= tce_iommu_detach_group,
> >>>>  };
> >>>>  
> >>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> >>>> +		u64 offset)
> >>>> +{
> >>>> +	struct tce_container *container = iommu_data;
> >>>> +	struct iommu_table *tbl = NULL;
> >>>> +
> >>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> >>>> +		return NULL;
> >>>> +
> >>>> +	iommu_table_get(tbl);
> >>>> +
> >>>> +	return tbl;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> >>>> +
> >>>>  static int __init tce_iommu_init(void)
> >>>>  {
> >>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> >>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> >>>>  MODULE_LICENSE("GPL v2");
> >>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> >>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> >>>> -
> >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>> index 0ecae0b..1c2138a 100644
> >>>> --- a/include/linux/vfio.h
> >>>> +++ b/include/linux/vfio.h
> >>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> >>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> >>>>  					  unsigned long arg);
> >>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> >>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> >>>> +extern void *vfio_container_get_iommu_data_ext(
> >>>> +		struct vfio_container *container);
> >>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> >>>> +		void *iommu_data, u64 offset);
> >>>>  
> >>>>  /*
> >>>>   * Sub-module helpers    
> >>>
> >>>
> >>> I think you need to take a closer look of the lifecycle of a container,
> >>> having a reference means the container itself won't go away, but only
> >>> having a group set within that container holds the actual IOMMU
> >>> references.  container->iommu_data is going to be NULL once the
> >>> groups are lost.  Thanks,    
> >>
> >>
> >> Container owns the iommu tables and this is what I care about here, groups
> >> attached or not - this is handled separately via IOMMU group list in a
> >> specific iommu_table struct, these groups get detached from iommu_table
> >> when they are removed from a container.  
> > 
> > The container doesn't own anything, the container is privileged by the
> > groups being attached to it.  When groups are closed, they detach from
> > the container and once the container group list is empty the iommu
> > backend is released and iommu_data is NULL.  A container reference
> > doesn't give you what you're looking for.  It implies nothing about the
> > iommu backend.  
> 
> 
> Well. Backend is a part of a container and since a backend owns tables, a
> container owns them too.

The IOMMU backend is accessed through the container, but that backend
is privileged by the groups it contains.  Once those groups are gone,
the IOMMU backend is released, regardless of whatever reference you
have to the container itself such as you're attempting to do here.  In
that sense, the container does not own those tables.

> The problem I am trying to solve here is when KVM may release the
> iommu_table objects.
> 
> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> matter) makes a link between KVM-spapr-tce-table and container and KVM can
> start using tables (with referencing them).
> 
> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> from region_del() and this works if QEMU removes a window. However if QEMU
> removes a vfio-pci device, region_del() is not called and KVM does not get
> notified that it can release the iommu_table's because the
> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> still used by emulated devices or other containers).
> 
> So it was suggested that we could do such "unset" somehow later assuming,
> for example, on every "set" I could check if some of currently attached
> containers are no more used - and this is where being able to know if there
> is no backend helps - KVM remembers a container pointer and can check this
> via vfio_container_get_iommu_data_ext().
> 
> The other option would be changing vfio_container_get_ext() to take a
> callback+opaque which container would call when it destroys iommu_data.
> This looks more intrusive and not very intuitive how to make it right -
> container would have to keep track of all registered external users and
> vfio_container_put_ext() would have to pass the same callback+opaque to
> unregister the exact external user.

I'm not in favor of anything resembling the code above or extensions
beyond it, the container is the wrong place to do this.

> Or I could store container file* in KVM. Then iommu_data would never be
> released until KVM-spapr-tce-table is destroyed.

See above, holding a file pointer to the container doesn't do squat.
The groups that are held by the container empower the IOMMU backend,
references to the container itself don't matter.  Those references will
not maintain the IOMMU data.
 
> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> would "unset" container from KVM-spapr-tce-table) is not an option as there
> still may be devices using this KVM-spapr-tce-table.
> 
> What obvious and nice solution am I missing here? Thanks.

The interactions with the IOMMU backend that seem relevant are
vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
device is also used to tell kvm about groups as they come and go and
has a way to check extensions, and thus properties of the IOMMU
backend.  All of these are available for your {ab}use.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-03  8:40 ` [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
                     ` (2 preceding siblings ...)
  2016-08-09  4:43   ` Balbir Singh
@ 2016-08-12  2:57   ` David Gibson
  2016-08-12  4:56     ` Alexey Kardashevskiy
  3 siblings, 1 reply; 60+ messages in thread
From: David Gibson @ 2016-08-12  2:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 15059 bytes --]

On Wed, Aug 03, 2016 at 06:40:46PM +1000, Alexey Kardashevskiy wrote:
> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better cache @mm and use it later when the process is gone
> (@current or @current->mm are NULL).
> 
> This changes mm_iommu_xxx API to receive mm_struct instead of using one
> from @current.
> 
> This is needed by the following patch to do proper cleanup in time.
> This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
> to do proper cleanup via tce_iommu_clear() patch.
> 
> To keep API consistent, this replaces mm_context_t with mm_struct;
> we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
> access to &mm->mmap_sem.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/mmu_context.h | 20 +++++++------
>  arch/powerpc/kernel/setup-common.c     |  2 +-
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
>  arch/powerpc/mm/mmu_context_iommu.c    | 54 ++++++++++++++--------------------
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 41 ++++++++++++++++----------
>  5 files changed, 62 insertions(+), 59 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 9d2cd0c..b85cc7b 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -18,16 +18,18 @@ extern void destroy_context(struct mm_struct *mm);
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>  struct mm_iommu_table_group_mem_t;
>  
> -extern bool mm_iommu_preregistered(void);
> -extern long mm_iommu_get(unsigned long ua, unsigned long entries,
> +extern bool mm_iommu_preregistered(struct mm_struct *mm);
> +extern long mm_iommu_get(struct mm_struct *mm,
> +		unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem);
> -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> -extern void mm_iommu_init(mm_context_t *ctx);
> -extern void mm_iommu_cleanup(mm_context_t *ctx);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> -		unsigned long size);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
> -		unsigned long entries);
> +extern long mm_iommu_put(struct mm_struct *mm,
> +		struct mm_iommu_table_group_mem_t *mem);
> +extern void mm_iommu_init(struct mm_struct *mm);
> +extern void mm_iommu_cleanup(struct mm_struct *mm);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
> +		unsigned long ua, unsigned long size);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> index 714b4ba..e90b68a 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -905,7 +905,7 @@ void __init setup_arch(char **cmdline_p)
>  	init_mm.context.pte_frag = NULL;
>  #endif
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_init(&init_mm.context);
> +	mm_iommu_init(&init_mm);
>  #endif
>  	irqstack_early_init();
>  	exc_lvl_early_init();
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index b114f8b..ad82735 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>  	mm->context.pte_frag = NULL;
>  #endif
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_init(&mm->context);
> +	mm_iommu_init(mm);
>  #endif
>  	return 0;
>  }
> @@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>  void destroy_context(struct mm_struct *mm)
>  {
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_cleanup(&mm->context);
> +	mm_iommu_cleanup(mm);
>  #endif
>  
>  #ifdef CONFIG_PPC_ICSWX
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index da6a216..ee6685b 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>  	}
>  
>  	pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
> -			current->pid,
> +			current ? current->pid : 0,
>  			incr ? '+' : '-',
>  			npages << PAGE_SHIFT,
>  			mm->locked_vm << PAGE_SHIFT,
> @@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>  	return ret;
>  }
>  
> -bool mm_iommu_preregistered(void)
> +bool mm_iommu_preregistered(struct mm_struct *mm)
>  {
> -	if (!current || !current->mm)
> -		return false;
> -
> -	return !list_empty(&current->mm->context.iommu_group_mem_list);
> +	return !list_empty(&mm->context.iommu_group_mem_list);
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> -long mm_iommu_get(unsigned long ua, unsigned long entries,
> +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  	long i, j, ret = 0, locked_entries = 0;
>  	struct page *page = NULL;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	mutex_lock(&mem_list_mutex);
>  
> -	list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
>  			next) {
>  		if ((mem->ua == ua) && (mem->entries == entries)) {
>  			++mem->used;
> @@ -102,7 +96,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>  
>  	}
>  
> -	ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
> +	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
>  	if (ret)
>  		goto unlock_exit;
>  
> @@ -142,11 +136,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>  	mem->entries = entries;
>  	*pmem = mem;
>  
> -	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
> +	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
>  
>  unlock_exit:
>  	if (locked_entries && ret)
> -		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
> +		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
>  
>  	mutex_unlock(&mem_list_mutex);
>  
> @@ -191,16 +185,13 @@ static void mm_iommu_free(struct rcu_head *head)
>  static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	list_del_rcu(&mem->next);
> -	mm_iommu_adjust_locked_vm(current->mm, mem->entries, false);

AFAICT, you've moved this call from _release() to _put().  Won't that cause a
behavioural change?

>  	call_rcu(&mem->rcu, mm_iommu_free);
>  }
>  
> -long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
> +long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
>  {
>  	long ret = 0;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
>  
>  	mutex_lock(&mem_list_mutex);
>  
> @@ -224,6 +215,8 @@ long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
>  	/* @mapped became 0 so now mappings are disabled, release the region */
>  	mm_iommu_release(mem);
>  
> +	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
> +
>  unlock_exit:
>  	mutex_unlock(&mem_list_mutex);
>  
> @@ -231,14 +224,12 @@ unlock_exit:
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_put);
>  
> -struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> -		unsigned long size)
> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
> +		unsigned long ua, unsigned long size)
>  {
>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>  
> -	list_for_each_entry_rcu(mem,
> -			&current->mm->context.iommu_group_mem_list,
> -			next) {
> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
>  		if ((mem->ua <= ua) &&
>  				(ua + size <= mem->ua +
>  				 (mem->entries << PAGE_SHIFT))) {
> @@ -251,14 +242,12 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_lookup);
>  
> -struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
> -		unsigned long entries)
> +struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +		unsigned long ua, unsigned long entries)
>  {
>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>  
> -	list_for_each_entry_rcu(mem,
> -			&current->mm->context.iommu_group_mem_list,
> -			next) {
> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
>  		if ((mem->ua == ua) && (mem->entries == entries)) {
>  			ret = mem;
>  			break;
> @@ -300,16 +289,17 @@ void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
>  
> -void mm_iommu_init(mm_context_t *ctx)
> +void mm_iommu_init(struct mm_struct *mm)
>  {
> -	INIT_LIST_HEAD_RCU(&ctx->iommu_group_mem_list);
> +	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>  }
>  
> -void mm_iommu_cleanup(mm_context_t *ctx)
> +void mm_iommu_cleanup(struct mm_struct *mm)
>  {
>  	struct mm_iommu_table_group_mem_t *mem, *tmp;
>  
> -	list_for_each_entry_safe(mem, tmp, &ctx->iommu_group_mem_list, next) {
> +	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> +			next) {
>  		list_del_rcu(&mem->next);
>  		mm_iommu_do_free(mem);
>  	}
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 80378dd..9752e77 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -98,6 +98,7 @@ struct tce_container {
>  	bool enabled;
>  	bool v2;
>  	unsigned long locked_pages;
> +	struct mm_struct *mm;
>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>  	struct list_head group_list;
>  };
> @@ -110,11 +111,11 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
>  
> -	mem = mm_iommu_find(vaddr, size >> PAGE_SHIFT);
> +	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
>  	if (!mem)
>  		return -ENOENT;
>  
> -	return mm_iommu_put(mem);
> +	return mm_iommu_put(container->mm, mem);
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  			((vaddr + size) < vaddr))
>  		return -EINVAL;
>  
> -	r<et = mm_iommu_get(vaddr, entries, &mem);
> +	if (!container->mm) {
> +		if (!current->mm)
> +			return -ESRCH; /* process exited */

Can this ever happen?  Surely the ioctl() path shouldn't be called
after the process mm has been cleaned up?  i.e. should this be a
WARN_ON().

> +
> +		atomic_inc(&current->mm->mm_count);

What balances this atomic_inc()?  Is it the mmdrop() added to
tce_iommu_release()?

> +		container->mm = current->mm;
> +	}

Surely you need an error (or else a BUG_ON()) if current->mm !=
container->mm != NULL.  I believe VFIO already assumes the container
is owned only by a single mm, but it looks like you should verify that here.

> +
> +	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>  	if (ret)
>  		return ret;
> -
>  	container->enabled = true;
>  
>  	return 0;
> @@ -354,6 +362,8 @@ static void tce_iommu_release(void *iommu_data)
>  		tce_iommu_free_table(tbl);
>  	}
>  
> +	if (container->mm)
> +		mmdrop(container->mm);
>  	tce_iommu_disable(container);
>  	mutex_destroy(&container->lock);
>  
> @@ -369,13 +379,14 @@ static void tce_iommu_unuse_page(struct tce_container *container,
>  	put_page(page);
>  }
>  
> -static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
> +static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
> +		unsigned long tce, unsigned long size,
>  		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	long ret = 0;
>  	struct mm_iommu_table_group_mem_t *mem;
>  
> -	mem = mm_iommu_lookup(tce, size);
> +	mem = mm_iommu_lookup(container->mm, tce, size);
>  	if (!mem)
>  		return -EINVAL;
>  
> @@ -388,18 +399,18 @@ static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
>  	return 0;
>  }
>  
> -static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
> -		unsigned long entry)
> +static void tce_iommu_unuse_page_v2(struct tce_container *container,
> +		struct iommu_table *tbl, unsigned long entry)
>  {
>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>  	int ret;
>  	unsigned long hpa = 0;
>  	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>  
> -	if (!pua || !current || !current->mm)
> +	if (!pua)
>  		return;
>  
> -	ret = tce_iommu_prereg_ua_to_hpa(*pua, IOMMU_PAGE_SIZE(tbl),
> +	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
>  			&hpa, &mem);
>  	if (ret)
>  		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
> @@ -429,7 +440,7 @@ static int tce_iommu_clear(struct tce_container *container,
>  			continue;
>  
>  		if (container->v2) {
> -			tce_iommu_unuse_page_v2(tbl, entry);
> +			tce_iommu_unuse_page_v2(container, tbl, entry);
>  			continue;
>  		}
>  
> @@ -514,8 +525,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
>  				entry + i);
>  
> -		ret = tce_iommu_prereg_ua_to_hpa(tce, IOMMU_PAGE_SIZE(tbl),
> -				&hpa, &mem);
> +		ret = tce_iommu_prereg_ua_to_hpa(container,
> +				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
>  		if (ret)
>  			break;
>  
> @@ -536,7 +547,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
>  		if (ret) {
>  			/* dirtmp cannot be DMA_NONE here */
> -			tce_iommu_unuse_page_v2(tbl, entry + i);
> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>  					__func__, entry << tbl->it_page_shift,
>  					tce, ret);
> @@ -544,7 +555,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		}
>  
>  		if (dirtmp != DMA_NONE)
> -			tce_iommu_unuse_page_v2(tbl, entry + i);
> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
>  
>  		*pua = tce;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit
  2016-08-03  8:40 ` [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit Alexey Kardashevskiy
  2016-08-03 10:11   ` Nicholas Piggin
@ 2016-08-12  3:13   ` David Gibson
  1 sibling, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  3:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 8248 bytes --]

On Wed, Aug 03, 2016 at 06:40:47PM +1000, Alexey Kardashevskiy wrote:
> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
> the userspace starts using VFIO.

This doesn't sound accurate.  Isn't it userspace that decides what
gets pinned, not the VFIO driver?

>When the userspace process finishes,
> all the pinned pages need to be put; this is done as a part of
> the userspace memory context (MM) destruction which happens on
> the very last mmdrop().
> 
> This approach has a problem that a MM of the userspace process
> may live longer than the userspace process itself as kernel threads
> use userspace process MMs which was runnning on a CPU where
> the kernel thread was scheduled to. If this happened, the MM remains
> referenced until this exact kernel thread wakes up again
> and releases the very last reference to the MM, on an idle system this
> can take even hours.
> 
> This references and caches MM once per container and adds tracking
> how many times each preregistered area was registered in
> a specific container. This way we do not depend on @current pointing to
> a valid task descriptor.

The handling of @current and refcounting the mm sounds more like its
describing the previous patch.

THe description of counting how many times each prereg area is
registered doesn't seem accurate, since you block multiple
registration with an EBUSY.  Or else it's describing the 'used'
counter in the lower-level mm_iommu_table_group_mem_t tracking,
rather than anything changed by this patch.

> This changes the userspace interface to return EBUSY if memory is
> already registered (mm_iommu_get() used to increment the counter);
> however it should not have any practical effect as the only
> userspace tool available now does register memory area once per
> container anyway.
> 
> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> under container->lock, this does not need additional locking.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> # Conflicts:
> #	arch/powerpc/include/asm/mmu_context.h
> #	arch/powerpc/mm/mmu_context_book3s64.c
> #	arch/powerpc/mm/mmu_context_iommu.c

Looks like some lines to be cleaned up in the message.

> ---
>  arch/powerpc/include/asm/mmu_context.h |  1 -
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>  arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 52 +++++++++++++++++++++++++++++++++-
>  4 files changed, 51 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index b85cc7b..a4c4ed5 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -25,7 +25,6 @@ extern long mm_iommu_get(struct mm_struct *mm,
>  extern long mm_iommu_put(struct mm_struct *mm,
>  		struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_init(struct mm_struct *mm);
> -extern void mm_iommu_cleanup(struct mm_struct *mm);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  		unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index ad82735..1a07969 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>  
>  void destroy_context(struct mm_struct *mm)
>  {
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_cleanup(mm);
> -#endif
> -
>  #ifdef CONFIG_PPC_ICSWX
>  	drop_cop(mm->context.acop, mm);
>  	kfree(mm->context.cop_lockp);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index ee6685b..10f01fe 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -293,14 +293,3 @@ void mm_iommu_init(struct mm_struct *mm)
>  {
>  	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>  }
> -
> -void mm_iommu_cleanup(struct mm_struct *mm)
> -{
> -	struct mm_iommu_table_group_mem_t *mem, *tmp;
> -
> -	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> -			next) {
> -		list_del_rcu(&mem->next);
> -		mm_iommu_do_free(mem);
> -	}
> -}
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 9752e77..40e71a0 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -89,6 +89,15 @@ struct tce_iommu_group {
>  };
>  
>  /*
> + * A container needs to remember which preregistered areas and how many times
> + * it has referenced to do proper cleanup at the userspace process exit.
> + */
> +struct tce_iommu_prereg {
> +	struct list_head next;
> +	struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +/*
>   * The container descriptor supports only a single group per container.
>   * Required by the API as the container is not supplied with the IOMMU group
>   * at the moment of initialization.
> @@ -101,12 +110,26 @@ struct tce_container {
>  	struct mm_struct *mm;
>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>  	struct list_head group_list;
> +	struct list_head prereg_list;
>  };
>  
> +static long tce_iommu_prereg_free(struct tce_container *container,
> +		struct tce_iommu_prereg *tcemem)
> +{
> +	long ret;
> +
> +	list_del(&tcemem->next);
> +	ret = mm_iommu_put(container->mm, tcemem->mem);
> +	kfree(tcemem);
> +
> +	return ret;
> +}
> +
>  static long tce_iommu_unregister_pages(struct tce_container *container,
>  		__u64 vaddr, __u64 size)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
> +	struct tce_iommu_prereg *tcemem;
>  
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
> @@ -115,7 +138,12 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  	if (!mem)
>  		return -ENOENT;
>  
> -	return mm_iommu_put(container->mm, mem);
> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
> +		if (tcemem->mem == mem)
> +			return tce_iommu_prereg_free(container, tcemem);
> +	}
> +
> +	return -ENOENT;
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -123,6 +151,7 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  {
>  	long ret = 0;
>  	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	struct tce_iommu_prereg *tcemem;
>  	unsigned long entries = size >> PAGE_SHIFT;
>  
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> @@ -140,6 +169,18 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>  	if (ret)
>  		return ret;
> +
> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
> +		if (tcemem->mem == mem) {
> +			mm_iommu_put(container->mm, mem);
> +			return -EBUSY;
> +		}
> +	}
> +
> +	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
> +	tcemem->mem = mem;
> +	list_add(&tcemem->next, &container->prereg_list);
> +
>  	container->enabled = true;
>  
>  	return 0;
> @@ -325,6 +366,7 @@ static void *tce_iommu_open(unsigned long arg)
>  
>  	mutex_init(&container->lock);
>  	INIT_LIST_HEAD_RCU(&container->group_list);
> +	INIT_LIST_HEAD_RCU(&container->prereg_list);
>  
>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>  
> @@ -362,6 +404,14 @@ static void tce_iommu_release(void *iommu_data)
>  		tce_iommu_free_table(tbl);
>  	}
>  
> +	while (!list_empty(&container->prereg_list)) {
> +		struct tce_iommu_prereg *tcemem;
> +
> +		tcemem = list_first_entry(&container->prereg_list,
> +				struct tce_iommu_prereg, next);
> +		tce_iommu_prereg_free(container, tcemem);
> +	}
> +
>  	if (container->mm)
>  		mmdrop(container->mm);
>  	tce_iommu_disable(container);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal
  2016-08-03  8:40 ` [PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal Alexey Kardashevskiy
@ 2016-08-12  3:18   ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  3:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 3908 bytes --]

On Wed, Aug 03, 2016 at 06:40:48PM +1000, Alexey Kardashevskiy wrote:
> At the moment iommu_table could be disposed by either calling
> iommu_table_free() directly or it_ops::free() which only implementation
> for IODA2 calls iommu_table_free() anyway.
> 
> As we are going to have reference counting on tables, we need an unified
> way of disposing tables.
> 
> This moves it_ops::free() call into iommu_free_table() and makes use
> of the latter everywhere. The free() callback now handles only
> platform-specific data.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kernel/iommu.c               | 4 ++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++----
>  drivers/vfio/vfio_iommu_spapr_tce.c       | 2 +-
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index a8e3490..13263b0 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -718,6 +718,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	if (!tbl)
>  		return;
>  
> +	if (tbl->it_ops->free)
> +		tbl->it_ops->free(tbl);
> +
>  	if (!tbl->it_map) {
>  		kfree(tbl);
>  		return;
> @@ -744,6 +747,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> +EXPORT_SYMBOL_GPL(iommu_free_table);
>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 59c7e7d..74ab8382 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1394,7 +1394,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	pnv_pci_ioda2_table_free_pages(tbl);
>  	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
>  }
>  
> @@ -1987,7 +1986,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>  static void pnv_ioda2_table_free(struct iommu_table *tbl)
>  {
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
>  }
>  
>  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> @@ -2313,7 +2311,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		pnv_ioda2_table_free(tbl);
> +		iommu_free_table(tbl, "");
>  		return rc;
>  	}
>  
> @@ -2399,7 +2397,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> -	pnv_ioda2_table_free(tbl);
> +	iommu_free_table(tbl, "pnv");
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 40e71a0..79f26c7 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -660,7 +660,7 @@ static void tce_iommu_free_table(struct iommu_table *tbl)
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl);
> -	tbl->it_ops->free(tbl);
> +	iommu_free_table(tbl, "");
>  	decrement_locked_vm(pages);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  2016-08-03  8:40 ` [PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
@ 2016-08-12  3:29   ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  3:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 8954 bytes --]

On Wed, Aug 03, 2016 at 06:40:49PM +1000, Alexey Kardashevskiy wrote:
> So far iommu_table obejcts were only used in virtual mode and had
> a single owner. We are going to change by implementing in-kernel
> acceleration of DMA mapping requests, including real mode.
> 
> This adds a kref to iommu_table and defines new helpers to update it.
> This replaces iommu_free_table() with iommu_table_put() and makes
> iommu_free_table() static. iommu_table_get() is not used in this patch
> but will be in the following one.
> 
> While we are here, this removes @node_name parameter as it has never been
> really useful on powernv and carrying it for the pseries platform code to
> iommu_free_table() seems to be quite useless too.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/iommu.h          |  5 +++--
>  arch/powerpc/kernel/iommu.c               | 24 +++++++++++++++++++-----
>  arch/powerpc/kernel/vio.c                 |  2 +-
>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++-------
>  arch/powerpc/platforms/powernv/pci.c      |  1 +
>  arch/powerpc/platforms/pseries/iommu.c    |  3 ++-
>  drivers/vfio/vfio_iommu_spapr_tce.c       |  2 +-
>  7 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index f49a72a..cd4df44 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -114,6 +114,7 @@ struct iommu_table {
>  	struct list_head it_group_list;/* List of iommu_table_group_link */
>  	unsigned long *it_userspace; /* userspace view of the table */
>  	struct iommu_table_ops *it_ops;
> +	struct kref    it_kref;
>  };
>  
>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> @@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
>  
>  extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
>  
> -/* Frees table for an individual device node */
> -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> +extern void iommu_table_get(struct iommu_table *tbl);
> +extern void iommu_table_put(struct iommu_table *tbl);
>  
>  /* Initializes an iommu_table based in values set in the passed-in
>   * structure
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 13263b0..a8f017a 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -710,13 +710,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
>  	return tbl;
>  }
>  
> -void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> +static void iommu_table_free(struct kref *kref)
>  {
>  	unsigned long bitmap_sz;
>  	unsigned int order;
> +	struct iommu_table *tbl;
>  
> -	if (!tbl)
> -		return;
> +	tbl = container_of(kref, struct iommu_table, it_kref);
>  
>  	if (tbl->it_ops->free)
>  		tbl->it_ops->free(tbl);
> @@ -735,7 +735,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  
>  	/* verify that table contains no entries */
>  	if (!bitmap_empty(tbl->it_map, tbl->it_size))
> -		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
> +		pr_warn("%s: Unexpected TCEs\n", __func__);
>  
>  	/* calculate bitmap size in bytes */
>  	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> @@ -747,7 +747,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  	/* free table */
>  	kfree(tbl);
>  }
> -EXPORT_SYMBOL_GPL(iommu_free_table);
> +
> +void iommu_table_get(struct iommu_table *tbl)
> +{
> +	kref_get(&tbl->it_kref);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_get);
> +
> +void iommu_table_put(struct iommu_table *tbl)
> +{
> +	if (!tbl)
> +		return;
> +
> +	kref_put(&tbl->it_kref, iommu_table_free);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_put);
>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
> index 8d7358f..188f452 100644
> --- a/arch/powerpc/kernel/vio.c
> +++ b/arch/powerpc/kernel/vio.c
> @@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
>  	struct iommu_table *tbl = get_iommu_table_base(dev);
>  
>  	if (tbl)
> -		iommu_free_table(tbl, of_node_full_name(dev->of_node));
> +		iommu_table_put(tbl);
>  	of_node_put(dev->of_node);
>  	kfree(to_vio_dev(dev));
>  }
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 74ab8382..c04afd2 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1394,7 +1394,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  		iommu_group_put(pe->table_group.group);
>  		BUG_ON(pe->table_group.group);
>  	}
> -	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
> @@ -2171,7 +2171,7 @@ found:
>  		__free_pages(tce_mem, get_order(tce32_segsz * segs));
>  	if (tbl) {
>  		pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  	}
>  }
>  
> @@ -2265,7 +2265,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>  			bus_offset, page_shift, window_size,
>  			levels, tbl);
>  	if (ret) {
> -		iommu_free_table(tbl, "pnv");
> +		iommu_table_put(tbl);
>  		return ret;
>  	}
>  
> @@ -2311,7 +2311,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	if (rc) {
>  		pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>  				rc);
> -		iommu_free_table(tbl, "");
> +		iommu_table_put(tbl);
>  		return rc;
>  	}
>  
> @@ -2397,7 +2397,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> @@ -3311,7 +3311,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
> @@ -3338,7 +3338,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
>  	}
>  
>  	pnv_pci_ioda2_table_free_pages(tbl);
> -	iommu_free_table(tbl, "pnv");
> +	iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 6701dd5..5917439 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
>  
>  	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  
>  	return tbl;
>  }
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 0056856..da29518 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -74,6 +74,7 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  		goto fail_exit;
>  
>  	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> +	kref_init(&tbl->it_kref);
>  	tgl->table_group = table_group;
>  	list_add_rcu(&tgl->next, &tbl->it_group_list);
>  
> @@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
>  		BUG_ON(table_group->group);
>  	}
>  #endif
> -	iommu_free_table(tbl, node_name);
> +	iommu_table_put(tbl);
>  
>  	kfree(table_group);
>  }
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 79f26c7..3594ad3 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -660,7 +660,7 @@ static void tce_iommu_free_table(struct iommu_table *tbl)
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	tce_iommu_userspace_view_free(tbl);
> -	iommu_free_table(tbl, "");
> +	iommu_table_put(tbl);
>  	decrement_locked_vm(pages);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory
  2016-08-03  8:40 ` [PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
@ 2016-08-12  4:02   ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  4:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 4172 bytes --]

On Wed, Aug 03, 2016 at 06:40:50PM +1000, Alexey Kardashevskiy wrote:
> This makes mm_iommu_lookup() able to work in realmode by replacing
> list_for_each_entry_rcu() (which can do debug stuff which can fail in
> real mode) with list_for_each_entry_lockless().
> 
> This adds realmode version of mm_iommu_ua_to_hpa() which adds
> explicit vmalloc'd-to-linear address conversion.
> Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.
> 
> This changes mm_iommu_preregistered() to receive @mm as in real mode
> @current does not always have a correct pointer.
> 
> This adds realmode version of mm_iommu_lookup() which receives @mm
> (for the same reason as for mm_iommu_preregistered()) and uses
> lockless version of list_for_each_entry_rcu().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 ++++
>  arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index a4c4ed5..939030c 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -27,10 +27,14 @@ extern long mm_iommu_put(struct mm_struct *mm,
>  extern void mm_iommu_init(struct mm_struct *mm);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  		unsigned long ua, unsigned long size);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
> +		struct mm_struct *mm, unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned long *hpa);
> +extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 10f01fe..36a906c 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -242,6 +242,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_lookup);
>  
> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
> +		unsigned long ua, unsigned long size)
> +{
> +	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
> +
> +	list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
> +			next) {
> +		if ((mem->ua <= ua) &&
> +				(ua + size <= mem->ua +
> +				 (mem->entries << PAGE_SHIFT))) {
> +			ret = mem;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
> +
>  struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries)
>  {
> @@ -273,6 +292,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
>  
> +long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> +		unsigned long ua, unsigned long *hpa)
> +{
> +	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> +	void *va = &mem->hpas[entry];
> +	unsigned long *ra;
> +
> +	if (entry >= mem->entries)
> +		return -EFAULT;
> +
> +	ra = (void *) vmalloc_to_phys(va);
> +	if (!ra)
> +		return -EFAULT;
> +
> +	*hpa = *ra | (ua & ~PAGE_MASK);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
> +
>  long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	if (atomic64_inc_not_zero(&mem->mapped))

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list
  2016-08-03  8:40 ` [PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
@ 2016-08-12  4:17   ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  4:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 5251 bytes --]

On Wed, Aug 03, 2016 at 06:40:51PM +1000, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL where declared
> (not in this patch).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v2:
> * updated the commit log with Paul's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>  1 file changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index d461c44..a3be4bd 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> +{
> +	return mm_iommu_preregistered(vcpu->kvm->mm);
> +}
> +
> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> +{
> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> -		return H_TOO_HARD;
> +	if (kvmppc_preregistered(vcpu)) {
> +		/*
> +		 * We get here if guest memory was pre-registered which
> +		 * is normally VFIO case and gpa->hpa translation does not
> +		 * depend on hpt.
> +		 */
> +		struct mm_iommu_table_group_mem_t *mem;
>  
> -	rmap = (void *) vmalloc_to_phys(rmap);
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
> +			return H_TOO_HARD;

Wouldn't it be clearer to put the gpa->ua lookup outside the if?
You'd have to throw away the rmap you get in the prereg case, but it
shouldn't be harmful, should it?

>  
> -	/*
> -	 * Synchronize with the MMU notifier callbacks in
> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -	 * While we have the rmap lock, code running on other CPUs
> -	 * cannot finish unmapping the host real page that backs
> -	 * this guest real page, so we are OK to access the host
> -	 * real page.
> -	 */
> -	lock_rmap(rmap);
> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> -		ret = H_TOO_HARD;
> -		goto unlock_exit;
> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
> +			return H_TOO_HARD;

This doesn't fall back to the rmap approach if it can't locate the
page in question in the prereg map.  IIUC that means that this will
now work less well than previously if you have a userspace which
preregisters some memory, but not all of guest RAM.  I'm not sure if
we care about that, since no such userspace currently exists.


> +	} else {
> +		/*
> +		 * This is emulated devices case.

This is a bit misleading - this case will only be triggered if there
are *no* prereg-ed VFIO devices.  The case above can be used even for
emulated devices, if there happen to also be VFIO devices present
which have preregistered guest RAM.

> +		 * We do not require memory to be preregistered in this case
> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
> +		 */
> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
> +			return H_TOO_HARD;
> +
> +		rmap = (void *) vmalloc_to_phys(rmap);
> +
> +		/*
> +		 * Synchronize with the MMU notifier callbacks in
> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> +		 * While we have the rmap lock, code running on other CPUs
> +		 * cannot finish unmapping the host real page that backs
> +		 * this guest real page, so we are OK to access the host
> +		 * real page.
> +		 */
> +		lock_rmap(rmap);
> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
> +			ret = H_TOO_HARD;
> +			goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  unlock_exit:
> -	unlock_rmap(rmap);
> +	if (rmap)
> +		unlock_rmap(rmap);
>  
>  	return ret;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
  2016-08-03  8:40 ` [PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
@ 2016-08-12  4:29   ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  4:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 5913 bytes --]

On Wed, Aug 03, 2016 at 06:40:52PM +1000, Alexey Kardashevskiy wrote:
> In real mode, TCE tables are invalidated using special
> cache-inhibited store instructions which are not available in
> virtual mode
> 
> This defines and implements exchange_rm() callback. This does not
> define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
> exchange/exchange_rm are only to be used by KVM for VFIO.
> 
> The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.
> 
> This replaces list_for_each_entry_rcu with its lockless version as
> from now on pnv_pci_ioda2_tce_invalidate() can be called in
> the real mode too.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h          |  7 +++++++
>  arch/powerpc/kernel/iommu.c               | 23 +++++++++++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++++++++++++-
>  3 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index cd4df44..a13d207 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -64,6 +64,11 @@ struct iommu_table_ops {
>  			long index,
>  			unsigned long *hpa,
>  			enum dma_data_direction *direction);
> +	/* Real mode */
> +	int (*exchange_rm)(struct iommu_table *tbl,
> +			long index,
> +			unsigned long *hpa,
> +			enum dma_data_direction *direction);
>  #endif
>  	void (*clear)(struct iommu_table *tbl,
>  			long index, long npages);
> @@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
>  extern int __init tce_iommu_bus_notifier_init(void);
>  extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>  		unsigned long *hpa, enum dma_data_direction *direction);
> +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction);
>  #else
>  static inline void iommu_register_group(struct iommu_table_group *table_group,
>  					int pci_domain_number,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index a8f017a..65b2dac 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1020,6 +1020,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>  }
>  EXPORT_SYMBOL_GPL(iommu_tce_xchg);
>  
> +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret;
> +
> +	ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +
> +	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> +			(*direction == DMA_BIDIRECTIONAL))) {
> +		struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
> +
> +		if (likely(pg)) {
> +			SetPageDirty(pg);
> +		} else {

Isn't there a race here, if someone else updates this TCE entry
between your initial exchange and the rollback exchange below?

> +			tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
> +			ret = -EFAULT;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
> +
>  int iommu_take_ownership(struct iommu_table *tbl)
>  {
>  	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c04afd2..a0b5ea6 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1827,6 +1827,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret)
> +		pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif
>  
>  static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> @@ -1841,6 +1852,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.set = pnv_ioda1_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda1_tce_xchg,
> +	.exchange_rm = pnv_ioda1_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda1_tce_free,
>  	.get = pnv_tce_get,
> @@ -1915,7 +1927,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>  {
>  	struct iommu_table_group_link *tgl;
>  
> -	list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
> +	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {

So.. IIUC, previously this had a bool rm parameter, but wouldn't
actually work in real mode even if it was set?

>  		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
>  				struct pnv_ioda_pe, table_group);
>  		struct pnv_phb *phb = pe->phb;
> @@ -1973,6 +1985,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
>  
>  	return ret;
>  }
> +
> +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
> +		unsigned long *hpa, enum dma_data_direction *direction)
> +{
> +	long ret = pnv_tce_xchg(tbl, index, hpa, direction);
> +
> +	if (!ret)
> +		pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
> +
> +	return ret;
> +}
>  #endif
>  
>  static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> @@ -1992,6 +2015,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>  	.set = pnv_ioda2_tce_build,
>  #ifdef CONFIG_IOMMU_API
>  	.exchange = pnv_ioda2_tce_xchg,
> +	.exchange_rm = pnv_ioda2_tce_xchg_rm,
>  #endif
>  	.clear = pnv_ioda2_tce_free,
>  	.get = pnv_tce_get,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  2016-08-03  8:40 ` [PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
@ 2016-08-12  4:34   ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  4:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 1663 bytes --]

On Wed, Aug 03, 2016 at 06:40:53PM +1000, Alexey Kardashevskiy wrote:
> It does not make much sense to have KVM in book3s-64 and
> not to have IOMMU bits for PCI pass through support as it costs little
> and allows VFIO to function on book3s KVM.
> 
> Having IOMMU_API always enabled makes it unnecessary to have a lot of
> "#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
> ifdef's we could have only user space emulated devices accelerated
> (but not VFIO) which do not seem to be very useful.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/kvm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index b7c494b..63b60a8 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>  	select KVM
>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>  	select KVM_VFIO if VFIO
> +	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>  	---help---
>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>  	  in virtual machines on book3s_64 host processors.

I don't quite see how this change accomplishes the stated goal.
AFAICT even with this change you can still turn off IOMMU_SUPPORT,
which will break the IOMMU for VFIO passthrough, but not IOMMU
acceleration for emulated devices (since that requires no interaction
with the hardware IOMMU).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table()
  2016-08-03  8:40 ` [PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
@ 2016-08-12  4:45   ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  4:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 5109 bytes --]

On Wed, Aug 03, 2016 at 06:40:54PM +1000, Alexey Kardashevskiy wrote:
> The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
> there. This will be used in the following patches where we will be
> attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
> to VCPU).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
>  arch/powerpc/kvm/book3s_64_vio.c    |  7 ++++---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++++++------
>  3 files changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 2544eda..7f1abe9 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> -		struct kvm_vcpu *vcpu, unsigned long liobn);
> +		struct kvm *kvm, unsigned long liobn);
>  extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
>  		unsigned long ioba, unsigned long npages);
>  extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index c379ff5..15df8ae 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -212,12 +212,13 @@ fail:
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	u64 __user *tces;
>  	u64 tce;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index a3be4bd..8a6834e 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -49,10 +49,9 @@
>   * WARNING: This will be called in real or virtual mode on HV KVM and virtual
>   *          mode on PR KVM
>   */
> -struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
> +struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
>  		unsigned long liobn)
>  {
> -	struct kvm *kvm = vcpu->kvm;
>  	struct kvmppc_spapr_tce_table *stt;
>  
>  	list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
> @@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	unsigned long tces, entry, ua = 0;
>  	unsigned long *rmap = NULL;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
>  
> -	stt = kvmppc_find_table(vcpu, liobn);
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> @@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba)
>  {
> -	struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> +	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
>  	unsigned long idx;
>  	struct page *page;
>  	u64 *tbl;
>  
> +	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
>  		return H_TOO_HARD;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-12  2:57   ` David Gibson
@ 2016-08-12  4:56     ` Alexey Kardashevskiy
  2016-08-15 10:58       ` David Gibson
  0 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-12  4:56 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 15983 bytes --]

On 12/08/16 12:57, David Gibson wrote:
> On Wed, Aug 03, 2016 at 06:40:46PM +1000, Alexey Kardashevskiy wrote:
>> In some situations the userspace memory context may live longer than
>> the userspace process itself so if we need to do proper memory context
>> cleanup, we better cache @mm and use it later when the process is gone
>> (@current or @current->mm are NULL).
>>
>> This changes mm_iommu_xxx API to receive mm_struct instead of using one
>> from @current.
>>
>> This is needed by the following patch to do proper cleanup in time.
>> This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
>> to do proper cleanup via tce_iommu_clear() patch.
>>
>> To keep API consistent, this replaces mm_context_t with mm_struct;
>> we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
>> access to &mm->mmap_sem.
>>
>> This should cause no behavioral change.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/mmu_context.h | 20 +++++++------
>>  arch/powerpc/kernel/setup-common.c     |  2 +-
>>  arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
>>  arch/powerpc/mm/mmu_context_iommu.c    | 54 ++++++++++++++--------------------
>>  drivers/vfio/vfio_iommu_spapr_tce.c    | 41 ++++++++++++++++----------
>>  5 files changed, 62 insertions(+), 59 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>> index 9d2cd0c..b85cc7b 100644
>> --- a/arch/powerpc/include/asm/mmu_context.h
>> +++ b/arch/powerpc/include/asm/mmu_context.h
>> @@ -18,16 +18,18 @@ extern void destroy_context(struct mm_struct *mm);
>>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>>  struct mm_iommu_table_group_mem_t;
>>  
>> -extern bool mm_iommu_preregistered(void);
>> -extern long mm_iommu_get(unsigned long ua, unsigned long entries,
>> +extern bool mm_iommu_preregistered(struct mm_struct *mm);
>> +extern long mm_iommu_get(struct mm_struct *mm,
>> +		unsigned long ua, unsigned long entries,
>>  		struct mm_iommu_table_group_mem_t **pmem);
>> -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
>> -extern void mm_iommu_init(mm_context_t *ctx);
>> -extern void mm_iommu_cleanup(mm_context_t *ctx);
>> -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>> -		unsigned long size);
>> -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
>> -		unsigned long entries);
>> +extern long mm_iommu_put(struct mm_struct *mm,
>> +		struct mm_iommu_table_group_mem_t *mem);
>> +extern void mm_iommu_init(struct mm_struct *mm);
>> +extern void mm_iommu_cleanup(struct mm_struct *mm);
>> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>> +		unsigned long ua, unsigned long size);
>> +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>> +		unsigned long ua, unsigned long entries);
>>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>>  		unsigned long ua, unsigned long *hpa);
>>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
>> index 714b4ba..e90b68a 100644
>> --- a/arch/powerpc/kernel/setup-common.c
>> +++ b/arch/powerpc/kernel/setup-common.c
>> @@ -905,7 +905,7 @@ void __init setup_arch(char **cmdline_p)
>>  	init_mm.context.pte_frag = NULL;
>>  #endif
>>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>> -	mm_iommu_init(&init_mm.context);
>> +	mm_iommu_init(&init_mm);
>>  #endif
>>  	irqstack_early_init();
>>  	exc_lvl_early_init();
>> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
>> index b114f8b..ad82735 100644
>> --- a/arch/powerpc/mm/mmu_context_book3s64.c
>> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
>> @@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>>  	mm->context.pte_frag = NULL;
>>  #endif
>>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>> -	mm_iommu_init(&mm->context);
>> +	mm_iommu_init(mm);
>>  #endif
>>  	return 0;
>>  }
>> @@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>>  void destroy_context(struct mm_struct *mm)
>>  {
>>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>> -	mm_iommu_cleanup(&mm->context);
>> +	mm_iommu_cleanup(mm);
>>  #endif
>>  
>>  #ifdef CONFIG_PPC_ICSWX
>> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
>> index da6a216..ee6685b 100644
>> --- a/arch/powerpc/mm/mmu_context_iommu.c
>> +++ b/arch/powerpc/mm/mmu_context_iommu.c
>> @@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>>  	}
>>  
>>  	pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
>> -			current->pid,
>> +			current ? current->pid : 0,
>>  			incr ? '+' : '-',
>>  			npages << PAGE_SHIFT,
>>  			mm->locked_vm << PAGE_SHIFT,
>> @@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>>  	return ret;
>>  }
>>  
>> -bool mm_iommu_preregistered(void)
>> +bool mm_iommu_preregistered(struct mm_struct *mm)
>>  {
>> -	if (!current || !current->mm)
>> -		return false;
>> -
>> -	return !list_empty(&current->mm->context.iommu_group_mem_list);
>> +	return !list_empty(&mm->context.iommu_group_mem_list);
>>  }
>>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>>  
>> -long mm_iommu_get(unsigned long ua, unsigned long entries,
>> +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>>  		struct mm_iommu_table_group_mem_t **pmem)
>>  {
>>  	struct mm_iommu_table_group_mem_t *mem;
>>  	long i, j, ret = 0, locked_entries = 0;
>>  	struct page *page = NULL;
>>  
>> -	if (!current || !current->mm)
>> -		return -ESRCH; /* process exited */
>> -
>>  	mutex_lock(&mem_list_mutex);
>>  
>> -	list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
>> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
>>  			next) {
>>  		if ((mem->ua == ua) && (mem->entries == entries)) {
>>  			++mem->used;
>> @@ -102,7 +96,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>>  
>>  	}
>>  
>> -	ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
>> +	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
>>  	if (ret)
>>  		goto unlock_exit;
>>  
>> @@ -142,11 +136,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>>  	mem->entries = entries;
>>  	*pmem = mem;
>>  
>> -	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
>> +	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
>>  
>>  unlock_exit:
>>  	if (locked_entries && ret)
>> -		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
>> +		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
>>  
>>  	mutex_unlock(&mem_list_mutex);
>>  
>> @@ -191,16 +185,13 @@ static void mm_iommu_free(struct rcu_head *head)
>>  static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
>>  {
>>  	list_del_rcu(&mem->next);
>> -	mm_iommu_adjust_locked_vm(current->mm, mem->entries, false);
> 
> AFAICT, you've moved this call from _release() to _put().  Won't that cause a
> behavioural change?


mm_iommu_put() calls mm_iommu_adjust_locked_vm() right after
m_iommu_release() so no, it does not look so.



>>  	call_rcu(&mem->rcu, mm_iommu_free);
>>  }
>>  
>> -long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
>> +long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
>>  {
>>  	long ret = 0;
>>  
>> -	if (!current || !current->mm)
>> -		return -ESRCH; /* process exited */
>>  
>>  	mutex_lock(&mem_list_mutex);
>>  
>> @@ -224,6 +215,8 @@ long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
>>  	/* @mapped became 0 so now mappings are disabled, release the region */
>>  	mm_iommu_release(mem);
>>  
>> +	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
>> +
>>  unlock_exit:
>>  	mutex_unlock(&mem_list_mutex);
>>  
>> @@ -231,14 +224,12 @@ unlock_exit:
>>  }
>>  EXPORT_SYMBOL_GPL(mm_iommu_put);
>>  
>> -struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>> -		unsigned long size)
>> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>> +		unsigned long ua, unsigned long size)
>>  {
>>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>>  
>> -	list_for_each_entry_rcu(mem,
>> -			&current->mm->context.iommu_group_mem_list,
>> -			next) {
>> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
>>  		if ((mem->ua <= ua) &&
>>  				(ua + size <= mem->ua +
>>  				 (mem->entries << PAGE_SHIFT))) {
>> @@ -251,14 +242,12 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>>  }
>>  EXPORT_SYMBOL_GPL(mm_iommu_lookup);
>>  
>> -struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
>> -		unsigned long entries)
>> +struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>> +		unsigned long ua, unsigned long entries)
>>  {
>>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>>  
>> -	list_for_each_entry_rcu(mem,
>> -			&current->mm->context.iommu_group_mem_list,
>> -			next) {
>> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
>>  		if ((mem->ua == ua) && (mem->entries == entries)) {
>>  			ret = mem;
>>  			break;
>> @@ -300,16 +289,17 @@ void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem)
>>  }
>>  EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
>>  
>> -void mm_iommu_init(mm_context_t *ctx)
>> +void mm_iommu_init(struct mm_struct *mm)
>>  {
>> -	INIT_LIST_HEAD_RCU(&ctx->iommu_group_mem_list);
>> +	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>>  }
>>  
>> -void mm_iommu_cleanup(mm_context_t *ctx)
>> +void mm_iommu_cleanup(struct mm_struct *mm)
>>  {
>>  	struct mm_iommu_table_group_mem_t *mem, *tmp;
>>  
>> -	list_for_each_entry_safe(mem, tmp, &ctx->iommu_group_mem_list, next) {
>> +	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
>> +			next) {
>>  		list_del_rcu(&mem->next);
>>  		mm_iommu_do_free(mem);
>>  	}
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 80378dd..9752e77 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -98,6 +98,7 @@ struct tce_container {
>>  	bool enabled;
>>  	bool v2;
>>  	unsigned long locked_pages;
>> +	struct mm_struct *mm;
>>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>>  	struct list_head group_list;
>>  };
>> @@ -110,11 +111,11 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>>  		return -EINVAL;
>>  
>> -	mem = mm_iommu_find(vaddr, size >> PAGE_SHIFT);
>> +	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
>>  	if (!mem)
>>  		return -ENOENT;
>>  
>> -	return mm_iommu_put(mem);
>> +	return mm_iommu_put(container->mm, mem);
>>  }
>>  
>>  static long tce_iommu_register_pages(struct tce_container *container,
>> @@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
>>  			((vaddr + size) < vaddr))
>>  		return -EINVAL;
>>  
>> -	r<et = mm_iommu_get(vaddr, entries, &mem);
>> +	if (!container->mm) {
>> +		if (!current->mm)
>> +			return -ESRCH; /* process exited */
> 
> Can this ever happen?  Surely the ioctl() path shouldn't be called
> after the process mm has been cleaned up?  i.e. should this be a
> WARN_ON().

Not sure with SMP (one thread doing ioctl(), another - exiting QEMU) if it
is not that impossible but it is quite hard to trigger this check.


> 
>> +
>> +		atomic_inc(&current->mm->mm_count);
> 
> What balances this atomic_inc()?  Is it the mmdrop() added to
> tce_iommu_release()?

Yes. Surprisingly there is no mmget(), there is mmget_not_zero() but it is
for mm->mm_users.

> 
>> +		container->mm = current->mm;
>> +	}
> 
> Surely you need an error (or else a BUG_ON()) if current->mm !=
> container->mm != NULL.  I believe VFIO already assumes the container
> is owned only by a single mm, but it looks like you should verify that here.

I am not sure I really want to enforce it, do I? Who knows what kind of a
crazy person would create a container, pin pages and fork() that userspace
tool which may not be QEMU but something custom using DPDK or something.

What harm can not having this BUG_ON() cause?


> 
>> +
>> +	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>>  	if (ret)
>>  		return ret;
>> -
>>  	container->enabled = true;
>>  
>>  	return 0;
>> @@ -354,6 +362,8 @@ static void tce_iommu_release(void *iommu_data)
>>  		tce_iommu_free_table(tbl);
>>  	}
>>  
>> +	if (container->mm)
>> +		mmdrop(container->mm);
>>  	tce_iommu_disable(container);
>>  	mutex_destroy(&container->lock);
>>  
>> @@ -369,13 +379,14 @@ static void tce_iommu_unuse_page(struct tce_container *container,
>>  	put_page(page);
>>  }
>>  
>> -static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
>> +static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
>> +		unsigned long tce, unsigned long size,
>>  		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
>>  {
>>  	long ret = 0;
>>  	struct mm_iommu_table_group_mem_t *mem;
>>  
>> -	mem = mm_iommu_lookup(tce, size);
>> +	mem = mm_iommu_lookup(container->mm, tce, size);
>>  	if (!mem)
>>  		return -EINVAL;
>>  
>> @@ -388,18 +399,18 @@ static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
>>  	return 0;
>>  }
>>  
>> -static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
>> -		unsigned long entry)
>> +static void tce_iommu_unuse_page_v2(struct tce_container *container,
>> +		struct iommu_table *tbl, unsigned long entry)
>>  {
>>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>>  	int ret;
>>  	unsigned long hpa = 0;
>>  	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>  
>> -	if (!pua || !current || !current->mm)
>> +	if (!pua)
>>  		return;
>>  
>> -	ret = tce_iommu_prereg_ua_to_hpa(*pua, IOMMU_PAGE_SIZE(tbl),
>> +	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
>>  			&hpa, &mem);
>>  	if (ret)
>>  		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
>> @@ -429,7 +440,7 @@ static int tce_iommu_clear(struct tce_container *container,
>>  			continue;
>>  
>>  		if (container->v2) {
>> -			tce_iommu_unuse_page_v2(tbl, entry);
>> +			tce_iommu_unuse_page_v2(container, tbl, entry);
>>  			continue;
>>  		}
>>  
>> @@ -514,8 +525,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
>>  		unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
>>  				entry + i);
>>  
>> -		ret = tce_iommu_prereg_ua_to_hpa(tce, IOMMU_PAGE_SIZE(tbl),
>> -				&hpa, &mem);
>> +		ret = tce_iommu_prereg_ua_to_hpa(container,
>> +				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
>>  		if (ret)
>>  			break;
>>  
>> @@ -536,7 +547,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>>  		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
>>  		if (ret) {
>>  			/* dirtmp cannot be DMA_NONE here */
>> -			tce_iommu_unuse_page_v2(tbl, entry + i);
>> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
>>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>>  					__func__, entry << tbl->it_page_shift,
>>  					tce, ret);
>> @@ -544,7 +555,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>>  		}
>>  
>>  		if (dirtmp != DMA_NONE)
>> -			tce_iommu_unuse_page_v2(tbl, entry + i);
>> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
>>  
>>  		*pua = tce;
>>  
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-03  8:40 ` [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users Alexey Kardashevskiy
  2016-08-08 16:43   ` Alex Williamson
@ 2016-08-12  5:25   ` David Gibson
  1 sibling, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  5:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 4359 bytes --]

On Wed, Aug 03, 2016 at 06:40:55PM +1000, Alexey Kardashevskiy wrote:
> This exports helpers which are needed to keep a VFIO container in
> memory while there are external users such as KVM.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

I'll address Alex W's broader concerns in a  different mail.  But
there are some more superficial problems with this as well.

> ---
>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>  include/linux/vfio.h                |  6 ++++++
>  3 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index d1d70e0..baf6a9c 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>  
>  /**
> + * External user API for containers, exported by symbols to be linked
> + * dynamically.
> + *
> + */
> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> +{
> +	struct vfio_container *container = filep->private_data;
> +
> +	if (filep->f_op != &vfio_fops)
> +		return ERR_PTR(-EINVAL);
> +
> +	vfio_container_get(container);
> +
> +	return container;
> +}
> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> +
> +void vfio_container_put_ext(struct vfio_container *container)
> +{
> +	vfio_container_put(container);
> +}
> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> +
> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> +{
> +	return container->iommu_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> +
> +/**
>   * Sub-module support
>   */
>  /*
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 3594ad3..fceea3d 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>  	.detach_group	= tce_iommu_detach_group,
>  };
>  
> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> +		u64 offset)

I really dislike this name.  I was confused for a while why this
existed on top of vfio_container_get_ext(), the names are so similar.


Making it take a void * is also really nasty since that void * has to
be something specific.  It would be better to have this take a
vfio_container *, verify that the container really does have an
spapr_tce backend, then lookup the tce_container and the actual IOMMU
tables within.

That might also let you drop vfio_container_get_iommu_data_ext()
entirely.

> +{
> +	struct tce_container *container = iommu_data;
> +	struct iommu_table *tbl = NULL;
> +
> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> +		return NULL;
> +
> +	iommu_table_get(tbl);
> +
> +	return tbl;
> +}
> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> +
>  static int __init tce_iommu_init(void)
>  {
>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>  MODULE_LICENSE("GPL v2");
>  MODULE_AUTHOR(DRIVER_AUTHOR);
>  MODULE_DESCRIPTION(DRIVER_DESC);
> -
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b..1c2138a 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>  extern long vfio_external_check_extension(struct vfio_group *group,
>  					  unsigned long arg);
> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> +extern void vfio_container_put_ext(struct vfio_container *container);
> +extern void *vfio_container_get_iommu_data_ext(
> +		struct vfio_container *container);
> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> +		void *iommu_data, u64 offset);
>  
>  /*
>   * Sub-module helpers

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-10 16:46           ` Alex Williamson
@ 2016-08-12  5:46             ` David Gibson
  2016-08-12  6:12               ` Alexey Kardashevskiy
  2016-08-12 15:22               ` Alex Williamson
  0 siblings, 2 replies; 60+ messages in thread
From: David Gibson @ 2016-08-12  5:46 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 10537 bytes --]

On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> On Wed, 10 Aug 2016 15:37:17 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 09/08/16 22:16, Alex Williamson wrote:
> > > On Tue, 9 Aug 2016 15:19:39 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >   
> > >> On 09/08/16 02:43, Alex Williamson wrote:  
> > >>> On Wed,  3 Aug 2016 18:40:55 +1000
> > >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >>>     
> > >>>> This exports helpers which are needed to keep a VFIO container in
> > >>>> memory while there are external users such as KVM.
> > >>>>
> > >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >>>> ---
> > >>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> > >>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> > >>>>  include/linux/vfio.h                |  6 ++++++
> > >>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> > >>>>
> > >>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > >>>> index d1d70e0..baf6a9c 100644
> > >>>> --- a/drivers/vfio/vfio.c
> > >>>> +++ b/drivers/vfio/vfio.c
> > >>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> > >>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> > >>>>  
> > >>>>  /**
> > >>>> + * External user API for containers, exported by symbols to be linked
> > >>>> + * dynamically.
> > >>>> + *
> > >>>> + */
> > >>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> > >>>> +{
> > >>>> +	struct vfio_container *container = filep->private_data;
> > >>>> +
> > >>>> +	if (filep->f_op != &vfio_fops)
> > >>>> +		return ERR_PTR(-EINVAL);
> > >>>> +
> > >>>> +	vfio_container_get(container);
> > >>>> +
> > >>>> +	return container;
> > >>>> +}
> > >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> > >>>> +
> > >>>> +void vfio_container_put_ext(struct vfio_container *container)
> > >>>> +{
> > >>>> +	vfio_container_put(container);
> > >>>> +}
> > >>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> > >>>> +
> > >>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> > >>>> +{
> > >>>> +	return container->iommu_data;
> > >>>> +}
> > >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> > >>>> +
> > >>>> +/**
> > >>>>   * Sub-module support
> > >>>>   */
> > >>>>  /*
> > >>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> > >>>> index 3594ad3..fceea3d 100644
> > >>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > >>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > >>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> > >>>>  	.detach_group	= tce_iommu_detach_group,
> > >>>>  };
> > >>>>  
> > >>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> > >>>> +		u64 offset)
> > >>>> +{
> > >>>> +	struct tce_container *container = iommu_data;
> > >>>> +	struct iommu_table *tbl = NULL;
> > >>>> +
> > >>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> > >>>> +		return NULL;
> > >>>> +
> > >>>> +	iommu_table_get(tbl);
> > >>>> +
> > >>>> +	return tbl;
> > >>>> +}
> > >>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> > >>>> +
> > >>>>  static int __init tce_iommu_init(void)
> > >>>>  {
> > >>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> > >>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> > >>>>  MODULE_LICENSE("GPL v2");
> > >>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> > >>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> > >>>> -
> > >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > >>>> index 0ecae0b..1c2138a 100644
> > >>>> --- a/include/linux/vfio.h
> > >>>> +++ b/include/linux/vfio.h
> > >>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> > >>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> > >>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> > >>>>  					  unsigned long arg);
> > >>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> > >>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> > >>>> +extern void *vfio_container_get_iommu_data_ext(
> > >>>> +		struct vfio_container *container);
> > >>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> > >>>> +		void *iommu_data, u64 offset);
> > >>>>  
> > >>>>  /*
> > >>>>   * Sub-module helpers    
> > >>>
> > >>>
> > >>> I think you need to take a closer look of the lifecycle of a container,
> > >>> having a reference means the container itself won't go away, but only
> > >>> having a group set within that container holds the actual IOMMU
> > >>> references.  container->iommu_data is going to be NULL once the
> > >>> groups are lost.  Thanks,    
> > >>
> > >>
> > >> Container owns the iommu tables and this is what I care about here, groups
> > >> attached or not - this is handled separately via IOMMU group list in a
> > >> specific iommu_table struct, these groups get detached from iommu_table
> > >> when they are removed from a container.  
> > > 
> > > The container doesn't own anything, the container is privileged by the
> > > groups being attached to it.  When groups are closed, they detach from
> > > the container and once the container group list is empty the iommu
> > > backend is released and iommu_data is NULL.  A container reference
> > > doesn't give you what you're looking for.  It implies nothing about the
> > > iommu backend.  
> > 
> > 
> > Well. Backend is a part of a container and since a backend owns tables, a
> > container owns them too.
> 
> The IOMMU backend is accessed through the container, but that backend
> is privileged by the groups it contains.  Once those groups are gone,
> the IOMMU backend is released, regardless of whatever reference you
> have to the container itself such as you're attempting to do here.  In
> that sense, the container does not own those tables.

So, the thing is that what KVM fundamentally needs is a handle on the
container.  KVM is essentially modelling the DMA address space of a
single guest bus, and the container is what's attached to that.

The first part of the problem is that KVM wants to basically invoke
vfio_dma_map() operations without bouncing via qemu.  Because
vfio_dma_map() works on the container level, that's the handle that
KVM needs to hold.

The second part of the problem is that in order to reduce overhead
further, we want to operate in real mode, which means bypassing most
of the usual VFIO structure and going directly(ish) from the KVM
hcall emulation to the IOMMU backend behind VFIO.  This complicates
matters a fair bit.  Because it is, explicitly, a performance hack,
some degree of ugliness is probably inevitable.

Alexey - actually implementing this in two stages might make this
clearer.  The first stage wouldn't allow real mode, and would call
through the same vfio_dma_map() path as qemu calls through now.  The
second stage would then put in place the necessary hacks to add real
mode support.

> > The problem I am trying to solve here is when KVM may release the
> > iommu_table objects.
> > 
> > "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> > matter) makes a link between KVM-spapr-tce-table and container and KVM can
> > start using tables (with referencing them).
> > 
> > First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> > from region_del() and this works if QEMU removes a window. However if QEMU
> > removes a vfio-pci device, region_del() is not called and KVM does not get
> > notified that it can release the iommu_table's because the
> > KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> > still used by emulated devices or other containers).
> > 
> > So it was suggested that we could do such "unset" somehow later assuming,
> > for example, on every "set" I could check if some of currently attached
> > containers are no more used - and this is where being able to know if there
> > is no backend helps - KVM remembers a container pointer and can check this
> > via vfio_container_get_iommu_data_ext().
> > 
> > The other option would be changing vfio_container_get_ext() to take a
> > callback+opaque which container would call when it destroys iommu_data.
> > This looks more intrusive and not very intuitive how to make it right -
> > container would have to keep track of all registered external users and
> > vfio_container_put_ext() would have to pass the same callback+opaque to
> > unregister the exact external user.
> 
> I'm not in favor of anything resembling the code above or extensions
> beyond it, the container is the wrong place to do this.
> 
> > Or I could store container file* in KVM. Then iommu_data would never be
> > released until KVM-spapr-tce-table is destroyed.
> 
> See above, holding a file pointer to the container doesn't do squat.
> The groups that are held by the container empower the IOMMU backend,
> references to the container itself don't matter.  Those references will
> not maintain the IOMMU data.
>  
> > Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> > would "unset" container from KVM-spapr-tce-table) is not an option as there
> > still may be devices using this KVM-spapr-tce-table.
> > 
> > What obvious and nice solution am I missing here? Thanks.
> 
> The interactions with the IOMMU backend that seem relevant are
> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> device is also used to tell kvm about groups as they come and go and
> has a way to check extensions, and thus properties of the IOMMU
> backend.  All of these are available for your {ab}use.  Thanks,

So, Alexey started trying to do this via the KVM-VFIO device, but it's
a really bad fit.  As noted above, fundamentally it's a container we
need to attach to the kvm-spapr-tce-table object, since what that
represents is a guest bus DMA address space, and by definition all the
groups in a container must have the same DMA address space.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-12  5:46             ` David Gibson
@ 2016-08-12  6:12               ` Alexey Kardashevskiy
  2016-08-15 11:07                 ` David Gibson
  2016-08-12 15:22               ` Alex Williamson
  1 sibling, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-12  6:12 UTC (permalink / raw)
  To: David Gibson, Alex Williamson; +Cc: linuxppc-dev, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 11446 bytes --]

On 12/08/16 15:46, David Gibson wrote:
> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
>> On Wed, 10 Aug 2016 15:37:17 +1000
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On 09/08/16 22:16, Alex Williamson wrote:
>>>> On Tue, 9 Aug 2016 15:19:39 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>   
>>>>> On 09/08/16 02:43, Alex Williamson wrote:  
>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>     
>>>>>>> This exports helpers which are needed to keep a VFIO container in
>>>>>>> memory while there are external users such as KVM.
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>>>>  include/linux/vfio.h                |  6 ++++++
>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>>> index d1d70e0..baf6a9c 100644
>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>>>>  
>>>>>>>  /**
>>>>>>> + * External user API for containers, exported by symbols to be linked
>>>>>>> + * dynamically.
>>>>>>> + *
>>>>>>> + */
>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>>>>> +{
>>>>>>> +	struct vfio_container *container = filep->private_data;
>>>>>>> +
>>>>>>> +	if (filep->f_op != &vfio_fops)
>>>>>>> +		return ERR_PTR(-EINVAL);
>>>>>>> +
>>>>>>> +	vfio_container_get(container);
>>>>>>> +
>>>>>>> +	return container;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>>>>> +
>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>>>>> +{
>>>>>>> +	vfio_container_put(container);
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>>>>> +
>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>>>>> +{
>>>>>>> +	return container->iommu_data;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>>>>> +
>>>>>>> +/**
>>>>>>>   * Sub-module support
>>>>>>>   */
>>>>>>>  /*
>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>> index 3594ad3..fceea3d 100644
>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>>>>  	.detach_group	= tce_iommu_detach_group,
>>>>>>>  };
>>>>>>>  
>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>>>>> +		u64 offset)
>>>>>>> +{
>>>>>>> +	struct tce_container *container = iommu_data;
>>>>>>> +	struct iommu_table *tbl = NULL;
>>>>>>> +
>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>>>>> +		return NULL;
>>>>>>> +
>>>>>>> +	iommu_table_get(tbl);
>>>>>>> +
>>>>>>> +	return tbl;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>>>>> +
>>>>>>>  static int __init tce_iommu_init(void)
>>>>>>>  {
>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>>>>  MODULE_LICENSE("GPL v2");
>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
>>>>>>> -
>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>>> index 0ecae0b..1c2138a 100644
>>>>>>> --- a/include/linux/vfio.h
>>>>>>> +++ b/include/linux/vfio.h
>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>>>>>>  					  unsigned long arg);
>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>>>>> +		struct vfio_container *container);
>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>>>>> +		void *iommu_data, u64 offset);
>>>>>>>  
>>>>>>>  /*
>>>>>>>   * Sub-module helpers    
>>>>>>
>>>>>>
>>>>>> I think you need to take a closer look of the lifecycle of a container,
>>>>>> having a reference means the container itself won't go away, but only
>>>>>> having a group set within that container holds the actual IOMMU
>>>>>> references.  container->iommu_data is going to be NULL once the
>>>>>> groups are lost.  Thanks,    
>>>>>
>>>>>
>>>>> Container owns the iommu tables and this is what I care about here, groups
>>>>> attached or not - this is handled separately via IOMMU group list in a
>>>>> specific iommu_table struct, these groups get detached from iommu_table
>>>>> when they are removed from a container.  
>>>>
>>>> The container doesn't own anything, the container is privileged by the
>>>> groups being attached to it.  When groups are closed, they detach from
>>>> the container and once the container group list is empty the iommu
>>>> backend is released and iommu_data is NULL.  A container reference
>>>> doesn't give you what you're looking for.  It implies nothing about the
>>>> iommu backend.  
>>>
>>>
>>> Well. Backend is a part of a container and since a backend owns tables, a
>>> container owns them too.
>>
>> The IOMMU backend is accessed through the container, but that backend
>> is privileged by the groups it contains.  Once those groups are gone,
>> the IOMMU backend is released, regardless of whatever reference you
>> have to the container itself such as you're attempting to do here.  In
>> that sense, the container does not own those tables.
> 
> So, the thing is that what KVM fundamentally needs is a handle on the
> container.  KVM is essentially modelling the DMA address space of a
> single guest bus, and the container is what's attached to that.
> 
> The first part of the problem is that KVM wants to basically invoke
> vfio_dma_map() operations without bouncing via qemu.  Because
> vfio_dma_map() works on the container level, that's the handle that
> KVM needs to hold.


Well, I do not need to hold the reference to the container all the time, I
just need it to get to the IOMMU backend, get+reference an iommu_table from
it, referencing here helps to make sure the backend is not going away
before we reference iommu_table.

After that I only keep a reference to the container to know if/when I can
release a particular iommu_table. This is can workaround by counting how
many groups were attached to this particular KVM-spapt-tce-table and
looking at the IOMMU group list attached to an iommu_table - if the list is
empty, decrement the iommu_table reference counter and that's it, no extra
references to a VFIO container.

Or I need an alternative way of getting iommu_table's, i.e. QEMU should
somehow tell KVM that this LIOBN is this VFIO container fd (easy - can be
done via region_add/region_del interface) or VFIO IOMMU group fd(s) (more
tricky as this needs to be done from more places - vfio-pci hotplug/unplug,
window add/remove).


> The second part of the problem is that in order to reduce overhead
> further, we want to operate in real mode, which means bypassing most
> of the usual VFIO structure and going directly(ish) from the KVM
> hcall emulation to the IOMMU backend behind VFIO.  This complicates
> matters a fair bit.  Because it is, explicitly, a performance hack,
> some degree of ugliness is probably inevitable.
> 
> Alexey - actually implementing this in two stages might make this
> clearer.  The first stage wouldn't allow real mode, and would call
> through the same vfio_dma_map() path as qemu calls through now.  The
> second stage would then put in place the necessary hacks to add real
> mode support.
> 
>>> The problem I am trying to solve here is when KVM may release the
>>> iommu_table objects.
>>>
>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
>>> start using tables (with referencing them).
>>>
>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
>>> from region_del() and this works if QEMU removes a window. However if QEMU
>>> removes a vfio-pci device, region_del() is not called and KVM does not get
>>> notified that it can release the iommu_table's because the
>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
>>> still used by emulated devices or other containers).
>>>
>>> So it was suggested that we could do such "unset" somehow later assuming,
>>> for example, on every "set" I could check if some of currently attached
>>> containers are no more used - and this is where being able to know if there
>>> is no backend helps - KVM remembers a container pointer and can check this
>>> via vfio_container_get_iommu_data_ext().
>>>
>>> The other option would be changing vfio_container_get_ext() to take a
>>> callback+opaque which container would call when it destroys iommu_data.
>>> This looks more intrusive and not very intuitive how to make it right -
>>> container would have to keep track of all registered external users and
>>> vfio_container_put_ext() would have to pass the same callback+opaque to
>>> unregister the exact external user.
>>
>> I'm not in favor of anything resembling the code above or extensions
>> beyond it, the container is the wrong place to do this.
>>
>>> Or I could store container file* in KVM. Then iommu_data would never be
>>> released until KVM-spapr-tce-table is destroyed.
>>
>> See above, holding a file pointer to the container doesn't do squat.
>> The groups that are held by the container empower the IOMMU backend,
>> references to the container itself don't matter.  Those references will
>> not maintain the IOMMU data.
>>  
>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
>>> still may be devices using this KVM-spapr-tce-table.
>>>
>>> What obvious and nice solution am I missing here? Thanks.
>>
>> The interactions with the IOMMU backend that seem relevant are
>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
>> device is also used to tell kvm about groups as they come and go and
>> has a way to check extensions, and thus properties of the IOMMU
>> backend.  All of these are available for your {ab}use.  Thanks,
> 
> So, Alexey started trying to do this via the KVM-VFIO device, but it's
> a really bad fit.  As noted above, fundamentally it's a container we
> need to attach to the kvm-spapr-tce-table object, since what that
> represents is a guest bus DMA address space, and by definition all the
> groups in a container must have the same DMA address space.


Well, in a bad case a LIOBN/kvm-spapr-tce-table has multiple containers
attached so it is not 1:1...



-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-12  5:46             ` David Gibson
  2016-08-12  6:12               ` Alexey Kardashevskiy
@ 2016-08-12 15:22               ` Alex Williamson
  2016-08-17  3:17                 ` David Gibson
  1 sibling, 1 reply; 60+ messages in thread
From: Alex Williamson @ 2016-08-12 15:22 UTC (permalink / raw)
  To: David Gibson; +Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras

On Fri, 12 Aug 2016 15:46:01 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> > On Wed, 10 Aug 2016 15:37:17 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On 09/08/16 22:16, Alex Williamson wrote:  
> > > > On Tue, 9 Aug 2016 15:19:39 +1000
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >     
> > > >> On 09/08/16 02:43, Alex Williamson wrote:    
> > > >>> On Wed,  3 Aug 2016 18:40:55 +1000
> > > >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >>>       
> > > >>>> This exports helpers which are needed to keep a VFIO container in
> > > >>>> memory while there are external users such as KVM.
> > > >>>>
> > > >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > >>>> ---
> > > >>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> > > >>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> > > >>>>  include/linux/vfio.h                |  6 ++++++
> > > >>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> > > >>>>
> > > >>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > >>>> index d1d70e0..baf6a9c 100644
> > > >>>> --- a/drivers/vfio/vfio.c
> > > >>>> +++ b/drivers/vfio/vfio.c
> > > >>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> > > >>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> > > >>>>  
> > > >>>>  /**
> > > >>>> + * External user API for containers, exported by symbols to be linked
> > > >>>> + * dynamically.
> > > >>>> + *
> > > >>>> + */
> > > >>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> > > >>>> +{
> > > >>>> +	struct vfio_container *container = filep->private_data;
> > > >>>> +
> > > >>>> +	if (filep->f_op != &vfio_fops)
> > > >>>> +		return ERR_PTR(-EINVAL);
> > > >>>> +
> > > >>>> +	vfio_container_get(container);
> > > >>>> +
> > > >>>> +	return container;
> > > >>>> +}
> > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> > > >>>> +
> > > >>>> +void vfio_container_put_ext(struct vfio_container *container)
> > > >>>> +{
> > > >>>> +	vfio_container_put(container);
> > > >>>> +}
> > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> > > >>>> +
> > > >>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> > > >>>> +{
> > > >>>> +	return container->iommu_data;
> > > >>>> +}
> > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> > > >>>> +
> > > >>>> +/**
> > > >>>>   * Sub-module support
> > > >>>>   */
> > > >>>>  /*
> > > >>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > >>>> index 3594ad3..fceea3d 100644
> > > >>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > > >>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > >>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> > > >>>>  	.detach_group	= tce_iommu_detach_group,
> > > >>>>  };
> > > >>>>  
> > > >>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> > > >>>> +		u64 offset)
> > > >>>> +{
> > > >>>> +	struct tce_container *container = iommu_data;
> > > >>>> +	struct iommu_table *tbl = NULL;
> > > >>>> +
> > > >>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> > > >>>> +		return NULL;
> > > >>>> +
> > > >>>> +	iommu_table_get(tbl);
> > > >>>> +
> > > >>>> +	return tbl;
> > > >>>> +}
> > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> > > >>>> +
> > > >>>>  static int __init tce_iommu_init(void)
> > > >>>>  {
> > > >>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> > > >>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> > > >>>>  MODULE_LICENSE("GPL v2");
> > > >>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> > > >>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> > > >>>> -
> > > >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > >>>> index 0ecae0b..1c2138a 100644
> > > >>>> --- a/include/linux/vfio.h
> > > >>>> +++ b/include/linux/vfio.h
> > > >>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> > > >>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> > > >>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> > > >>>>  					  unsigned long arg);
> > > >>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> > > >>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> > > >>>> +extern void *vfio_container_get_iommu_data_ext(
> > > >>>> +		struct vfio_container *container);
> > > >>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> > > >>>> +		void *iommu_data, u64 offset);
> > > >>>>  
> > > >>>>  /*
> > > >>>>   * Sub-module helpers      
> > > >>>
> > > >>>
> > > >>> I think you need to take a closer look of the lifecycle of a container,
> > > >>> having a reference means the container itself won't go away, but only
> > > >>> having a group set within that container holds the actual IOMMU
> > > >>> references.  container->iommu_data is going to be NULL once the
> > > >>> groups are lost.  Thanks,      
> > > >>
> > > >>
> > > >> Container owns the iommu tables and this is what I care about here, groups
> > > >> attached or not - this is handled separately via IOMMU group list in a
> > > >> specific iommu_table struct, these groups get detached from iommu_table
> > > >> when they are removed from a container.    
> > > > 
> > > > The container doesn't own anything, the container is privileged by the
> > > > groups being attached to it.  When groups are closed, they detach from
> > > > the container and once the container group list is empty the iommu
> > > > backend is released and iommu_data is NULL.  A container reference
> > > > doesn't give you what you're looking for.  It implies nothing about the
> > > > iommu backend.    
> > > 
> > > 
> > > Well. Backend is a part of a container and since a backend owns tables, a
> > > container owns them too.  
> > 
> > The IOMMU backend is accessed through the container, but that backend
> > is privileged by the groups it contains.  Once those groups are gone,
> > the IOMMU backend is released, regardless of whatever reference you
> > have to the container itself such as you're attempting to do here.  In
> > that sense, the container does not own those tables.  
> 
> So, the thing is that what KVM fundamentally needs is a handle on the
> container.  KVM is essentially modelling the DMA address space of a
> single guest bus, and the container is what's attached to that.
> 
> The first part of the problem is that KVM wants to basically invoke
> vfio_dma_map() operations without bouncing via qemu.  Because
> vfio_dma_map() works on the container level, that's the handle that
> KVM needs to hold.
> 
> The second part of the problem is that in order to reduce overhead
> further, we want to operate in real mode, which means bypassing most
> of the usual VFIO structure and going directly(ish) from the KVM
> hcall emulation to the IOMMU backend behind VFIO.  This complicates
> matters a fair bit.  Because it is, explicitly, a performance hack,
> some degree of ugliness is probably inevitable.
> 
> Alexey - actually implementing this in two stages might make this
> clearer.  The first stage wouldn't allow real mode, and would call
> through the same vfio_dma_map() path as qemu calls through now.  The
> second stage would then put in place the necessary hacks to add real
> mode support.
> 
> > > The problem I am trying to solve here is when KVM may release the
> > > iommu_table objects.
> > > 
> > > "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> > > matter) makes a link between KVM-spapr-tce-table and container and KVM can
> > > start using tables (with referencing them).
> > > 
> > > First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> > > from region_del() and this works if QEMU removes a window. However if QEMU
> > > removes a vfio-pci device, region_del() is not called and KVM does not get
> > > notified that it can release the iommu_table's because the
> > > KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> > > still used by emulated devices or other containers).
> > > 
> > > So it was suggested that we could do such "unset" somehow later assuming,
> > > for example, on every "set" I could check if some of currently attached
> > > containers are no more used - and this is where being able to know if there
> > > is no backend helps - KVM remembers a container pointer and can check this
> > > via vfio_container_get_iommu_data_ext().
> > > 
> > > The other option would be changing vfio_container_get_ext() to take a
> > > callback+opaque which container would call when it destroys iommu_data.
> > > This looks more intrusive and not very intuitive how to make it right -
> > > container would have to keep track of all registered external users and
> > > vfio_container_put_ext() would have to pass the same callback+opaque to
> > > unregister the exact external user.  
> > 
> > I'm not in favor of anything resembling the code above or extensions
> > beyond it, the container is the wrong place to do this.
> >   
> > > Or I could store container file* in KVM. Then iommu_data would never be
> > > released until KVM-spapr-tce-table is destroyed.  
> > 
> > See above, holding a file pointer to the container doesn't do squat.
> > The groups that are held by the container empower the IOMMU backend,
> > references to the container itself don't matter.  Those references will
> > not maintain the IOMMU data.
> >    
> > > Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> > > would "unset" container from KVM-spapr-tce-table) is not an option as there
> > > still may be devices using this KVM-spapr-tce-table.
> > > 
> > > What obvious and nice solution am I missing here? Thanks.  
> > 
> > The interactions with the IOMMU backend that seem relevant are
> > vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> > device is also used to tell kvm about groups as they come and go and
> > has a way to check extensions, and thus properties of the IOMMU
> > backend.  All of these are available for your {ab}use.  Thanks,  
> 
> So, Alexey started trying to do this via the KVM-VFIO device, but it's
> a really bad fit.  As noted above, fundamentally it's a container we
> need to attach to the kvm-spapr-tce-table object, since what that
> represents is a guest bus DMA address space, and by definition all the
> groups in a container must have the same DMA address space.

That's all fine and good, but the point remains that a reference to the
container is no assurance of the iommu state.  The iommu state is
maintained by the user and the groups attached to the container.  If
the groups are removed, your container reference no long has any iommu
backing and iommu_data is worthless.  The user can do this as well by
un-setting the iommu.  I understand what you're trying to do, it's just
wrong.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-09 12:16       ` Alex Williamson
  2016-08-10  5:37         ` Alexey Kardashevskiy
@ 2016-08-15  3:59         ` Paul Mackerras
  2016-08-15 15:32           ` Alex Williamson
  1 sibling, 1 reply; 60+ messages in thread
From: Paul Mackerras @ 2016-08-15  3:59 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson

On Tue, Aug 09, 2016 at 06:16:30AM -0600, Alex Williamson wrote:
> On Tue, 9 Aug 2016 15:19:39 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 09/08/16 02:43, Alex Williamson wrote:
> > > 
> > > I think you need to take a closer look of the lifecycle of a container,
> > > having a reference means the container itself won't go away, but only
> > > having a group set within that container holds the actual IOMMU
> > > references.  container->iommu_data is going to be NULL once the
> > > groups are lost.  Thanks,  
> > 
> > 
> > Container owns the iommu tables and this is what I care about here, groups
> > attached or not - this is handled separately via IOMMU group list in a
> > specific iommu_table struct, these groups get detached from iommu_table
> > when they are removed from a container.
> 
> The container doesn't own anything, the container is privileged by the
> groups being attached to it.  When groups are closed, they detach from
> the container and once the container group list is empty the iommu
> backend is released and iommu_data is NULL.  A container reference
> doesn't give you what you're looking for.  It implies nothing about the
> iommu backend.

Alex, I'd like to understand more what the objection is here - is it
just about the object lifetimes, or is it a more fundamental objection
to the style of interface?

Regarding lifetimes, my understanding was that Alexey's previous
patches added refcounting to the iommu tables, so that KVM could get a
reference to the iommu tables through the container and then safely
use the iommu tables directly.  There may still be a potential race in
the interval between asking the container about its iommu tables and
incrementing the tables' reference counts, but that should be able to
be solved.  I don't see any unsolvable problem regarding lifetimes.

Or is your objection about any external access to the container?
As far as I know, when a group is not part of a container it has its
own iommu tables, but when it is put in a container it loses its own
iommu tables and instead uses a common pair of iommu tables (one for
the 32-bit window, one for the 64-bit window) that belong to the
container.  So we do in fact need the container's iommu tables not the
individual groups' tables.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id"
  2016-08-03  8:40 ` [PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id" Alexey Kardashevskiy
@ 2016-08-15  4:58   ` Paul Mackerras
  0 siblings, 0 replies; 60+ messages in thread
From: Paul Mackerras @ 2016-08-15  4:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, David Gibson, Alex Williamson

On Wed, Aug 03, 2016 at 06:40:42PM +1000, Alexey Kardashevskiy wrote:
> This reverts commit aa16bea929ae
> ("iommu: Add a function to find an iommu group by id")
> as the iommu_group_get_by_id() helper has never been used
> and it is unlikely it will in foreseeable future. Dead code
> is broken code.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[...]
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index a35fb8b..93c69fa 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -215,7 +215,6 @@ extern int bus_set_iommu(struct bus_type *bus, const struct iommu_ops *ops);
>  extern bool iommu_present(struct bus_type *bus);
>  extern bool iommu_capable(struct bus_type *bus, enum iommu_cap cap);
>  extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
> -extern struct iommu_group *iommu_group_get_by_id(int id);
>  extern void iommu_domain_free(struct iommu_domain *domain);
>  extern int iommu_attach_device(struct iommu_domain *domain,
>  			       struct device *dev);

There's another definition of iommu_group_get_by_id() further down in
iommu.h (static inline after the #else on CONFIG_IOMMU_API) which also
needs to be removed.

Paul.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-08-12  4:56     ` Alexey Kardashevskiy
@ 2016-08-15 10:58       ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-08-15 10:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Alex Williamson, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 17600 bytes --]

On Fri, Aug 12, 2016 at 02:56:59PM +1000, Alexey Kardashevskiy wrote:
> On 12/08/16 12:57, David Gibson wrote:
> > On Wed, Aug 03, 2016 at 06:40:46PM +1000, Alexey Kardashevskiy wrote:
> >> In some situations the userspace memory context may live longer than
> >> the userspace process itself so if we need to do proper memory context
> >> cleanup, we better cache @mm and use it later when the process is gone
> >> (@current or @current->mm are NULL).
> >>
> >> This changes mm_iommu_xxx API to receive mm_struct instead of using one
> >> from @current.
> >>
> >> This is needed by the following patch to do proper cleanup in time.
> >> This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
> >> to do proper cleanup via tce_iommu_clear() patch.
> >>
> >> To keep API consistent, this replaces mm_context_t with mm_struct;
> >> we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
> >> access to &mm->mmap_sem.
> >>
> >> This should cause no behavioral change.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  arch/powerpc/include/asm/mmu_context.h | 20 +++++++------
> >>  arch/powerpc/kernel/setup-common.c     |  2 +-
> >>  arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
> >>  arch/powerpc/mm/mmu_context_iommu.c    | 54 ++++++++++++++--------------------
> >>  drivers/vfio/vfio_iommu_spapr_tce.c    | 41 ++++++++++++++++----------
> >>  5 files changed, 62 insertions(+), 59 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >> index 9d2cd0c..b85cc7b 100644
> >> --- a/arch/powerpc/include/asm/mmu_context.h
> >> +++ b/arch/powerpc/include/asm/mmu_context.h
> >> @@ -18,16 +18,18 @@ extern void destroy_context(struct mm_struct *mm);
> >>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> >>  struct mm_iommu_table_group_mem_t;
> >>  
> >> -extern bool mm_iommu_preregistered(void);
> >> -extern long mm_iommu_get(unsigned long ua, unsigned long entries,
> >> +extern bool mm_iommu_preregistered(struct mm_struct *mm);
> >> +extern long mm_iommu_get(struct mm_struct *mm,
> >> +		unsigned long ua, unsigned long entries,
> >>  		struct mm_iommu_table_group_mem_t **pmem);
> >> -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> >> -extern void mm_iommu_init(mm_context_t *ctx);
> >> -extern void mm_iommu_cleanup(mm_context_t *ctx);
> >> -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> >> -		unsigned long size);
> >> -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
> >> -		unsigned long entries);
> >> +extern long mm_iommu_put(struct mm_struct *mm,
> >> +		struct mm_iommu_table_group_mem_t *mem);
> >> +extern void mm_iommu_init(struct mm_struct *mm);
> >> +extern void mm_iommu_cleanup(struct mm_struct *mm);
> >> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
> >> +		unsigned long ua, unsigned long size);
> >> +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> >> +		unsigned long ua, unsigned long entries);
> >>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> >>  		unsigned long ua, unsigned long *hpa);
> >>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
> >> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> >> index 714b4ba..e90b68a 100644
> >> --- a/arch/powerpc/kernel/setup-common.c
> >> +++ b/arch/powerpc/kernel/setup-common.c
> >> @@ -905,7 +905,7 @@ void __init setup_arch(char **cmdline_p)
> >>  	init_mm.context.pte_frag = NULL;
> >>  #endif
> >>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> >> -	mm_iommu_init(&init_mm.context);
> >> +	mm_iommu_init(&init_mm);
> >>  #endif
> >>  	irqstack_early_init();
> >>  	exc_lvl_early_init();
> >> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> >> index b114f8b..ad82735 100644
> >> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> >> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> >> @@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >>  	mm->context.pte_frag = NULL;
> >>  #endif
> >>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> >> -	mm_iommu_init(&mm->context);
> >> +	mm_iommu_init(mm);
> >>  #endif
> >>  	return 0;
> >>  }
> >> @@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
> >>  void destroy_context(struct mm_struct *mm)
> >>  {
> >>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> >> -	mm_iommu_cleanup(&mm->context);
> >> +	mm_iommu_cleanup(mm);
> >>  #endif
> >>  
> >>  #ifdef CONFIG_PPC_ICSWX
> >> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> >> index da6a216..ee6685b 100644
> >> --- a/arch/powerpc/mm/mmu_context_iommu.c
> >> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> >> @@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
> >>  	}
> >>  
> >>  	pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
> >> -			current->pid,
> >> +			current ? current->pid : 0,
> >>  			incr ? '+' : '-',
> >>  			npages << PAGE_SHIFT,
> >>  			mm->locked_vm << PAGE_SHIFT,
> >> @@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
> >>  	return ret;
> >>  }
> >>  
> >> -bool mm_iommu_preregistered(void)
> >> +bool mm_iommu_preregistered(struct mm_struct *mm)
> >>  {
> >> -	if (!current || !current->mm)
> >> -		return false;
> >> -
> >> -	return !list_empty(&current->mm->context.iommu_group_mem_list);
> >> +	return !list_empty(&mm->context.iommu_group_mem_list);
> >>  }
> >>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
> >>  
> >> -long mm_iommu_get(unsigned long ua, unsigned long entries,
> >> +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> >>  		struct mm_iommu_table_group_mem_t **pmem)
> >>  {
> >>  	struct mm_iommu_table_group_mem_t *mem;
> >>  	long i, j, ret = 0, locked_entries = 0;
> >>  	struct page *page = NULL;
> >>  
> >> -	if (!current || !current->mm)
> >> -		return -ESRCH; /* process exited */
> >> -
> >>  	mutex_lock(&mem_list_mutex);
> >>  
> >> -	list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> >> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
> >>  			next) {
> >>  		if ((mem->ua == ua) && (mem->entries == entries)) {
> >>  			++mem->used;
> >> @@ -102,7 +96,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
> >>  
> >>  	}
> >>  
> >> -	ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
> >> +	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
> >>  	if (ret)
> >>  		goto unlock_exit;
> >>  
> >> @@ -142,11 +136,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
> >>  	mem->entries = entries;
> >>  	*pmem = mem;
> >>  
> >> -	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
> >> +	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
> >>  
> >>  unlock_exit:
> >>  	if (locked_entries && ret)
> >> -		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
> >> +		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
> >>  
> >>  	mutex_unlock(&mem_list_mutex);
> >>  
> >> @@ -191,16 +185,13 @@ static void mm_iommu_free(struct rcu_head *head)
> >>  static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
> >>  {
> >>  	list_del_rcu(&mem->next);
> >> -	mm_iommu_adjust_locked_vm(current->mm, mem->entries, false);
> > 
> > AFAICT, you've moved this call from _release() to _put().  Won't that cause a
> > behavioural change?
> 
> 
> mm_iommu_put() calls mm_iommu_adjust_locked_vm() right after
> m_iommu_release() so no, it does not look so.

Ah, I guess not.  It seems a bit arbitrary in the context of the rest
of the changes, though.

> >>  	call_rcu(&mem->rcu, mm_iommu_free);
> >>  }
> >>  
> >> -long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
> >> +long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
> >>  {
> >>  	long ret = 0;
> >>  
> >> -	if (!current || !current->mm)
> >> -		return -ESRCH; /* process exited */
> >>  
> >>  	mutex_lock(&mem_list_mutex);
> >>  
> >> @@ -224,6 +215,8 @@ long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
> >>  	/* @mapped became 0 so now mappings are disabled, release the region */
> >>  	mm_iommu_release(mem);
> >>  
> >> +	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
> >> +
> >>  unlock_exit:
> >>  	mutex_unlock(&mem_list_mutex);
> >>  
> >> @@ -231,14 +224,12 @@ unlock_exit:
> >>  }
> >>  EXPORT_SYMBOL_GPL(mm_iommu_put);
> >>  
> >> -struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> >> -		unsigned long size)
> >> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
> >> +		unsigned long ua, unsigned long size)
> >>  {
> >>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
> >>  
> >> -	list_for_each_entry_rcu(mem,
> >> -			&current->mm->context.iommu_group_mem_list,
> >> -			next) {
> >> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
> >>  		if ((mem->ua <= ua) &&
> >>  				(ua + size <= mem->ua +
> >>  				 (mem->entries << PAGE_SHIFT))) {
> >> @@ -251,14 +242,12 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> >>  }
> >>  EXPORT_SYMBOL_GPL(mm_iommu_lookup);
> >>  
> >> -struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
> >> -		unsigned long entries)
> >> +struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> >> +		unsigned long ua, unsigned long entries)
> >>  {
> >>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
> >>  
> >> -	list_for_each_entry_rcu(mem,
> >> -			&current->mm->context.iommu_group_mem_list,
> >> -			next) {
> >> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
> >>  		if ((mem->ua == ua) && (mem->entries == entries)) {
> >>  			ret = mem;
> >>  			break;
> >> @@ -300,16 +289,17 @@ void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem)
> >>  }
> >>  EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
> >>  
> >> -void mm_iommu_init(mm_context_t *ctx)
> >> +void mm_iommu_init(struct mm_struct *mm)
> >>  {
> >> -	INIT_LIST_HEAD_RCU(&ctx->iommu_group_mem_list);
> >> +	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
> >>  }
> >>  
> >> -void mm_iommu_cleanup(mm_context_t *ctx)
> >> +void mm_iommu_cleanup(struct mm_struct *mm)
> >>  {
> >>  	struct mm_iommu_table_group_mem_t *mem, *tmp;
> >>  
> >> -	list_for_each_entry_safe(mem, tmp, &ctx->iommu_group_mem_list, next) {
> >> +	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> >> +			next) {
> >>  		list_del_rcu(&mem->next);
> >>  		mm_iommu_do_free(mem);
> >>  	}
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> index 80378dd..9752e77 100644
> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -98,6 +98,7 @@ struct tce_container {
> >>  	bool enabled;
> >>  	bool v2;
> >>  	unsigned long locked_pages;
> >> +	struct mm_struct *mm;
> >>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >>  	struct list_head group_list;
> >>  };
> >> @@ -110,11 +111,11 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
> >>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> >>  		return -EINVAL;
> >>  
> >> -	mem = mm_iommu_find(vaddr, size >> PAGE_SHIFT);
> >> +	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
> >>  	if (!mem)
> >>  		return -ENOENT;
> >>  
> >> -	return mm_iommu_put(mem);
> >> +	return mm_iommu_put(container->mm, mem);
> >>  }
> >>  
> >>  static long tce_iommu_register_pages(struct tce_container *container,
> >> @@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
> >>  			((vaddr + size) < vaddr))
> >>  		return -EINVAL;
> >>  
> >> -	r<et = mm_iommu_get(vaddr, entries, &mem);
> >> +	if (!container->mm) {
> >> +		if (!current->mm)
> >> +			return -ESRCH; /* process exited */
> > 
> > Can this ever happen?  Surely the ioctl() path shouldn't be called
> > after the process mm has been cleaned up?  i.e. should this be a
> > WARN_ON().
> 
> Not sure with SMP (one thread doing ioctl(), another - exiting QEMU) if it
> is not that impossible but it is quite hard to trigger this check.

I'm pretty sure the mm can't be cleaned up until all threads have
definitely stopped executing.

> >> +
> >> +		atomic_inc(&current->mm->mm_count);
> > 
> > What balances this atomic_inc()?  Is it the mmdrop() added to
> > tce_iommu_release()?
> 
> Yes. Surprisingly there is no mmget(), there is mmget_not_zero() but it is
> for mm->mm_users.

Ok.

> > 
> >> +		container->mm = current->mm;
> >> +	}
> > 
> > Surely you need an error (or else a BUG_ON()) if current->mm !=
> > container->mm != NULL.  I believe VFIO already assumes the container
> > is owned only by a single mm, but it looks like you should verify that here.
> 
> I am not sure I really want to enforce it, do I? Who knows what kind of a
> crazy person would create a container, pin pages and fork() that userspace
> tool which may not be QEMU but something custom using DPDK or something.
> 
> What harm can not having this BUG_ON() cause?

Hrm.  Well, if nothing else it lets one process lock (or unlock) pages
in an essentially unrelated process, which is pretty weird.  I don't
see any obvious way it will cause more serious problems.  But, as a
general rule it makes debugging easier if you check / enforce required
assumptions at the earliest possible point.

> > 
> >> +
> >> +	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
> >>  	if (ret)
> >>  		return ret;
> >> -
> >>  	container->enabled = true;
> >>  
> >>  	return 0;
> >> @@ -354,6 +362,8 @@ static void tce_iommu_release(void *iommu_data)
> >>  		tce_iommu_free_table(tbl);
> >>  	}
> >>  
> >> +	if (container->mm)
> >> +		mmdrop(container->mm);
> >>  	tce_iommu_disable(container);
> >>  	mutex_destroy(&container->lock);
> >>  
> >> @@ -369,13 +379,14 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> >>  	put_page(page);
> >>  }
> >>  
> >> -static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
> >> +static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
> >> +		unsigned long tce, unsigned long size,
> >>  		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
> >>  {
> >>  	long ret = 0;
> >>  	struct mm_iommu_table_group_mem_t *mem;
> >>  
> >> -	mem = mm_iommu_lookup(tce, size);
> >> +	mem = mm_iommu_lookup(container->mm, tce, size);
> >>  	if (!mem)
> >>  		return -EINVAL;
> >>  
> >> @@ -388,18 +399,18 @@ static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
> >>  	return 0;
> >>  }
> >>  
> >> -static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
> >> -		unsigned long entry)
> >> +static void tce_iommu_unuse_page_v2(struct tce_container *container,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >>  {
> >>  	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>  	int ret;
> >>  	unsigned long hpa = 0;
> >>  	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>  
> >> -	if (!pua || !current || !current->mm)
> >> +	if (!pua)
> >>  		return;
> >>  
> >> -	ret = tce_iommu_prereg_ua_to_hpa(*pua, IOMMU_PAGE_SIZE(tbl),
> >> +	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
> >>  			&hpa, &mem);
> >>  	if (ret)
> >>  		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
> >> @@ -429,7 +440,7 @@ static int tce_iommu_clear(struct tce_container *container,
> >>  			continue;
> >>  
> >>  		if (container->v2) {
> >> -			tce_iommu_unuse_page_v2(tbl, entry);
> >> +			tce_iommu_unuse_page_v2(container, tbl, entry);
> >>  			continue;
> >>  		}
> >>  
> >> @@ -514,8 +525,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
> >>  		unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
> >>  				entry + i);
> >>  
> >> -		ret = tce_iommu_prereg_ua_to_hpa(tce, IOMMU_PAGE_SIZE(tbl),
> >> -				&hpa, &mem);
> >> +		ret = tce_iommu_prereg_ua_to_hpa(container,
> >> +				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
> >>  		if (ret)
> >>  			break;
> >>  
> >> @@ -536,7 +547,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
> >>  		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
> >>  		if (ret) {
> >>  			/* dirtmp cannot be DMA_NONE here */
> >> -			tce_iommu_unuse_page_v2(tbl, entry + i);
> >> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
> >>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
> >>  					__func__, entry << tbl->it_page_shift,
> >>  					tce, ret);
> >> @@ -544,7 +555,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
> >>  		}
> >>  
> >>  		if (dirtmp != DMA_NONE)
> >> -			tce_iommu_unuse_page_v2(tbl, entry + i);
> >> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
> >>  
> >>  		*pua = tce;
> >>  
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-12  6:12               ` Alexey Kardashevskiy
@ 2016-08-15 11:07                 ` David Gibson
  2016-08-17  8:31                   ` Alexey Kardashevskiy
  0 siblings, 1 reply; 60+ messages in thread
From: David Gibson @ 2016-08-15 11:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, linuxppc-dev, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 13077 bytes --]

On Fri, Aug 12, 2016 at 04:12:17PM +1000, Alexey Kardashevskiy wrote:
> On 12/08/16 15:46, David Gibson wrote:
> > On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> >> On Wed, 10 Aug 2016 15:37:17 +1000
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>
> >>> On 09/08/16 22:16, Alex Williamson wrote:
> >>>> On Tue, 9 Aug 2016 15:19:39 +1000
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>   
> >>>>> On 09/08/16 02:43, Alex Williamson wrote:  
> >>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>     
> >>>>>>> This exports helpers which are needed to keep a VFIO container in
> >>>>>>> memory while there are external users such as KVM.
> >>>>>>>
> >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>> ---
> >>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> >>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> >>>>>>>  include/linux/vfio.h                |  6 ++++++
> >>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>>>>> index d1d70e0..baf6a9c 100644
> >>>>>>> --- a/drivers/vfio/vfio.c
> >>>>>>> +++ b/drivers/vfio/vfio.c
> >>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> >>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> >>>>>>>  
> >>>>>>>  /**
> >>>>>>> + * External user API for containers, exported by symbols to be linked
> >>>>>>> + * dynamically.
> >>>>>>> + *
> >>>>>>> + */
> >>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> >>>>>>> +{
> >>>>>>> +	struct vfio_container *container = filep->private_data;
> >>>>>>> +
> >>>>>>> +	if (filep->f_op != &vfio_fops)
> >>>>>>> +		return ERR_PTR(-EINVAL);
> >>>>>>> +
> >>>>>>> +	vfio_container_get(container);
> >>>>>>> +
> >>>>>>> +	return container;
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> >>>>>>> +
> >>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
> >>>>>>> +{
> >>>>>>> +	vfio_container_put(container);
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> >>>>>>> +
> >>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> >>>>>>> +{
> >>>>>>> +	return container->iommu_data;
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> >>>>>>> +
> >>>>>>> +/**
> >>>>>>>   * Sub-module support
> >>>>>>>   */
> >>>>>>>  /*
> >>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>> index 3594ad3..fceea3d 100644
> >>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> >>>>>>>  	.detach_group	= tce_iommu_detach_group,
> >>>>>>>  };
> >>>>>>>  
> >>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> >>>>>>> +		u64 offset)
> >>>>>>> +{
> >>>>>>> +	struct tce_container *container = iommu_data;
> >>>>>>> +	struct iommu_table *tbl = NULL;
> >>>>>>> +
> >>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> >>>>>>> +		return NULL;
> >>>>>>> +
> >>>>>>> +	iommu_table_get(tbl);
> >>>>>>> +
> >>>>>>> +	return tbl;
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> >>>>>>> +
> >>>>>>>  static int __init tce_iommu_init(void)
> >>>>>>>  {
> >>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> >>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> >>>>>>>  MODULE_LICENSE("GPL v2");
> >>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> >>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> >>>>>>> -
> >>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>>>>> index 0ecae0b..1c2138a 100644
> >>>>>>> --- a/include/linux/vfio.h
> >>>>>>> +++ b/include/linux/vfio.h
> >>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> >>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> >>>>>>>  					  unsigned long arg);
> >>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> >>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> >>>>>>> +extern void *vfio_container_get_iommu_data_ext(
> >>>>>>> +		struct vfio_container *container);
> >>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> >>>>>>> +		void *iommu_data, u64 offset);
> >>>>>>>  
> >>>>>>>  /*
> >>>>>>>   * Sub-module helpers    
> >>>>>>
> >>>>>>
> >>>>>> I think you need to take a closer look of the lifecycle of a container,
> >>>>>> having a reference means the container itself won't go away, but only
> >>>>>> having a group set within that container holds the actual IOMMU
> >>>>>> references.  container->iommu_data is going to be NULL once the
> >>>>>> groups are lost.  Thanks,    
> >>>>>
> >>>>>
> >>>>> Container owns the iommu tables and this is what I care about here, groups
> >>>>> attached or not - this is handled separately via IOMMU group list in a
> >>>>> specific iommu_table struct, these groups get detached from iommu_table
> >>>>> when they are removed from a container.  
> >>>>
> >>>> The container doesn't own anything, the container is privileged by the
> >>>> groups being attached to it.  When groups are closed, they detach from
> >>>> the container and once the container group list is empty the iommu
> >>>> backend is released and iommu_data is NULL.  A container reference
> >>>> doesn't give you what you're looking for.  It implies nothing about the
> >>>> iommu backend.  
> >>>
> >>>
> >>> Well. Backend is a part of a container and since a backend owns tables, a
> >>> container owns them too.
> >>
> >> The IOMMU backend is accessed through the container, but that backend
> >> is privileged by the groups it contains.  Once those groups are gone,
> >> the IOMMU backend is released, regardless of whatever reference you
> >> have to the container itself such as you're attempting to do here.  In
> >> that sense, the container does not own those tables.
> > 
> > So, the thing is that what KVM fundamentally needs is a handle on the
> > container.  KVM is essentially modelling the DMA address space of a
> > single guest bus, and the container is what's attached to that.
> > 
> > The first part of the problem is that KVM wants to basically invoke
> > vfio_dma_map() operations without bouncing via qemu.  Because
> > vfio_dma_map() works on the container level, that's the handle that
> > KVM needs to hold.
> 
> 
> Well, I do not need to hold the reference to the container all the time, I
> just need it to get to the IOMMU backend, get+reference an iommu_table from
> it, referencing here helps to make sure the backend is not going away
> before we reference iommu_table.

Yes, but I don't see a compelling reason *not* to hold the container
reference either - it seems like principle of least surprise would
suggest retaining the reference.

For example, I can imagine having a container reset call which threw
away the back end iommu table and created a new one.  It seems like
what you'd expect in this case is for the guest bus to remain bound to
the same container, not to the now stale iommu table.

> After that I only keep a reference to the container to know if/when I can
> release a particular iommu_table. This is can workaround by counting how
> many groups were attached to this particular KVM-spapt-tce-table and
> looking at the IOMMU group list attached to an iommu_table - if the list is
> empty, decrement the iommu_table reference counter and that's it, no extra
> references to a VFIO container.
> 
> Or I need an alternative way of getting iommu_table's, i.e. QEMU should
> somehow tell KVM that this LIOBN is this VFIO container fd (easy - can be
> done via region_add/region_del interface)

Um.. yes.. that's what I was expecting, I thought that was what you
were doing.x

> or VFIO IOMMU group fd(s) (more
> tricky as this needs to be done from more places - vfio-pci hotplug/unplug,
> window add/remove).

More tricky and also wrong.  Again, having one group but not the whole
container bound to the guest LIOBN doesn't make any sense - by
definition, all the devices in the container should share the same DMA
address space.

> > The second part of the problem is that in order to reduce overhead
> > further, we want to operate in real mode, which means bypassing most
> > of the usual VFIO structure and going directly(ish) from the KVM
> > hcall emulation to the IOMMU backend behind VFIO.  This complicates
> > matters a fair bit.  Because it is, explicitly, a performance hack,
> > some degree of ugliness is probably inevitable.
> > 
> > Alexey - actually implementing this in two stages might make this
> > clearer.  The first stage wouldn't allow real mode, and would call
> > through the same vfio_dma_map() path as qemu calls through now.  The
> > second stage would then put in place the necessary hacks to add real
> > mode support.
> > 
> >>> The problem I am trying to solve here is when KVM may release the
> >>> iommu_table objects.
> >>>
> >>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> >>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
> >>> start using tables (with referencing them).
> >>>
> >>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> >>> from region_del() and this works if QEMU removes a window. However if QEMU
> >>> removes a vfio-pci device, region_del() is not called and KVM does not get
> >>> notified that it can release the iommu_table's because the
> >>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> >>> still used by emulated devices or other containers).
> >>>
> >>> So it was suggested that we could do such "unset" somehow later assuming,
> >>> for example, on every "set" I could check if some of currently attached
> >>> containers are no more used - and this is where being able to know if there
> >>> is no backend helps - KVM remembers a container pointer and can check this
> >>> via vfio_container_get_iommu_data_ext().
> >>>
> >>> The other option would be changing vfio_container_get_ext() to take a
> >>> callback+opaque which container would call when it destroys iommu_data.
> >>> This looks more intrusive and not very intuitive how to make it right -
> >>> container would have to keep track of all registered external users and
> >>> vfio_container_put_ext() would have to pass the same callback+opaque to
> >>> unregister the exact external user.
> >>
> >> I'm not in favor of anything resembling the code above or extensions
> >> beyond it, the container is the wrong place to do this.
> >>
> >>> Or I could store container file* in KVM. Then iommu_data would never be
> >>> released until KVM-spapr-tce-table is destroyed.
> >>
> >> See above, holding a file pointer to the container doesn't do squat.
> >> The groups that are held by the container empower the IOMMU backend,
> >> references to the container itself don't matter.  Those references will
> >> not maintain the IOMMU data.
> >>  
> >>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> >>> would "unset" container from KVM-spapr-tce-table) is not an option as there
> >>> still may be devices using this KVM-spapr-tce-table.
> >>>
> >>> What obvious and nice solution am I missing here? Thanks.
> >>
> >> The interactions with the IOMMU backend that seem relevant are
> >> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> >> device is also used to tell kvm about groups as they come and go and
> >> has a way to check extensions, and thus properties of the IOMMU
> >> backend.  All of these are available for your {ab}use.  Thanks,
> > 
> > So, Alexey started trying to do this via the KVM-VFIO device, but it's
> > a really bad fit.  As noted above, fundamentally it's a container we
> > need to attach to the kvm-spapr-tce-table object, since what that
> > represents is a guest bus DMA address space, and by definition all the
> > groups in a container must have the same DMA address space.
> 
> 
> Well, in a bad case a LIOBN/kvm-spapr-tce-table has multiple containers
> attached so it is not 1:1...

I never said it was.  It's n:1, but it's *not* n:m.  You can have
multiple containers to a LIOBN, but never multiple LIOBNs ot a
container.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-15  3:59         ` Paul Mackerras
@ 2016-08-15 15:32           ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2016-08-15 15:32 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson

On Mon, 15 Aug 2016 13:59:47 +1000
Paul Mackerras <paulus@ozlabs.org> wrote:

> On Tue, Aug 09, 2016 at 06:16:30AM -0600, Alex Williamson wrote:
> > On Tue, 9 Aug 2016 15:19:39 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On 09/08/16 02:43, Alex Williamson wrote:  
> > > > 
> > > > I think you need to take a closer look of the lifecycle of a container,
> > > > having a reference means the container itself won't go away, but only
> > > > having a group set within that container holds the actual IOMMU
> > > > references.  container->iommu_data is going to be NULL once the
> > > > groups are lost.  Thanks,    
> > > 
> > > 
> > > Container owns the iommu tables and this is what I care about here, groups
> > > attached or not - this is handled separately via IOMMU group list in a
> > > specific iommu_table struct, these groups get detached from iommu_table
> > > when they are removed from a container.  
> > 
> > The container doesn't own anything, the container is privileged by the
> > groups being attached to it.  When groups are closed, they detach from
> > the container and once the container group list is empty the iommu
> > backend is released and iommu_data is NULL.  A container reference
> > doesn't give you what you're looking for.  It implies nothing about the
> > iommu backend.  
> 
> Alex, I'd like to understand more what the objection is here - is it
> just about the object lifetimes, or is it a more fundamental objection
> to the style of interface?
> 
> Regarding lifetimes, my understanding was that Alexey's previous
> patches added refcounting to the iommu tables, so that KVM could get a
> reference to the iommu tables through the container and then safely
> use the iommu tables directly.  There may still be a potential race in
> the interval between asking the container about its iommu tables and
> incrementing the tables' reference counts, but that should be able to
> be solved.  I don't see any unsolvable problem regarding lifetimes.
> 
> Or is your objection about any external access to the container?
> As far as I know, when a group is not part of a container it has its
> own iommu tables, but when it is put in a container it loses its own
> iommu tables and instead uses a common pair of iommu tables (one for
> the 32-bit window, one for the 64-bit window) that belong to the
> container.  So we do in fact need the container's iommu tables not the
> individual groups' tables.

Hi Paul,

Have you looked at this?  The ends do not justify the means.  First off
we're trying to create an external user interface to get and put a
reference to a container for the purpose of getting a reference to
iommu data.  So you might expect that that reference actually maintains
that iommu data, right?  Wrong.  The container is just the gateway
through which we access the iommu, a reference to the container doesn't
actually include a reference to the iommu backing it.  The user can
unset and reconstitute a new iommu state any time they want to and it's
actually the groups that privilege the container to have an iommu state
at all, so removal of groups automatically de-privileges the container
and the reference to the state is lost.  So the reference we're
creating a meaningless for the intended context.

Furthermore, why are we trying to get this reference?  Alexey wants to
add an interface that allows an _external_ user to get the _opaque_,
_private_ data structure for the iommu backend.  Are bells and whistles
going off in your head yet?  Without any validation of what iommu
backend is running we pass that void* to a function that casts it as a
struct tce_container and starts iterating through it.  This is
horribly, horribly wrong.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-12 15:22               ` Alex Williamson
@ 2016-08-17  3:17                 ` David Gibson
  2016-08-18  0:22                   ` Alexey Kardashevskiy
  0 siblings, 1 reply; 60+ messages in thread
From: David Gibson @ 2016-08-17  3:17 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 12344 bytes --]

On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
> On Fri, 12 Aug 2016 15:46:01 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> > > On Wed, 10 Aug 2016 15:37:17 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >   
> > > > On 09/08/16 22:16, Alex Williamson wrote:  
> > > > > On Tue, 9 Aug 2016 15:19:39 +1000
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >     
> > > > >> On 09/08/16 02:43, Alex Williamson wrote:    
> > > > >>> On Wed,  3 Aug 2016 18:40:55 +1000
> > > > >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >>>       
> > > > >>>> This exports helpers which are needed to keep a VFIO container in
> > > > >>>> memory while there are external users such as KVM.
> > > > >>>>
> > > > >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > >>>> ---
> > > > >>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> > > > >>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> > > > >>>>  include/linux/vfio.h                |  6 ++++++
> > > > >>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> > > > >>>>
> > > > >>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > > >>>> index d1d70e0..baf6a9c 100644
> > > > >>>> --- a/drivers/vfio/vfio.c
> > > > >>>> +++ b/drivers/vfio/vfio.c
> > > > >>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> > > > >>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> > > > >>>>  
> > > > >>>>  /**
> > > > >>>> + * External user API for containers, exported by symbols to be linked
> > > > >>>> + * dynamically.
> > > > >>>> + *
> > > > >>>> + */
> > > > >>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> > > > >>>> +{
> > > > >>>> +	struct vfio_container *container = filep->private_data;
> > > > >>>> +
> > > > >>>> +	if (filep->f_op != &vfio_fops)
> > > > >>>> +		return ERR_PTR(-EINVAL);
> > > > >>>> +
> > > > >>>> +	vfio_container_get(container);
> > > > >>>> +
> > > > >>>> +	return container;
> > > > >>>> +}
> > > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> > > > >>>> +
> > > > >>>> +void vfio_container_put_ext(struct vfio_container *container)
> > > > >>>> +{
> > > > >>>> +	vfio_container_put(container);
> > > > >>>> +}
> > > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> > > > >>>> +
> > > > >>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> > > > >>>> +{
> > > > >>>> +	return container->iommu_data;
> > > > >>>> +}
> > > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> > > > >>>> +
> > > > >>>> +/**
> > > > >>>>   * Sub-module support
> > > > >>>>   */
> > > > >>>>  /*
> > > > >>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > > >>>> index 3594ad3..fceea3d 100644
> > > > >>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > > > >>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > > >>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> > > > >>>>  	.detach_group	= tce_iommu_detach_group,
> > > > >>>>  };
> > > > >>>>  
> > > > >>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> > > > >>>> +		u64 offset)
> > > > >>>> +{
> > > > >>>> +	struct tce_container *container = iommu_data;
> > > > >>>> +	struct iommu_table *tbl = NULL;
> > > > >>>> +
> > > > >>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> > > > >>>> +		return NULL;
> > > > >>>> +
> > > > >>>> +	iommu_table_get(tbl);
> > > > >>>> +
> > > > >>>> +	return tbl;
> > > > >>>> +}
> > > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> > > > >>>> +
> > > > >>>>  static int __init tce_iommu_init(void)
> > > > >>>>  {
> > > > >>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> > > > >>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> > > > >>>>  MODULE_LICENSE("GPL v2");
> > > > >>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> > > > >>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> > > > >>>> -
> > > > >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > > >>>> index 0ecae0b..1c2138a 100644
> > > > >>>> --- a/include/linux/vfio.h
> > > > >>>> +++ b/include/linux/vfio.h
> > > > >>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> > > > >>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> > > > >>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> > > > >>>>  					  unsigned long arg);
> > > > >>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> > > > >>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> > > > >>>> +extern void *vfio_container_get_iommu_data_ext(
> > > > >>>> +		struct vfio_container *container);
> > > > >>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> > > > >>>> +		void *iommu_data, u64 offset);
> > > > >>>>  
> > > > >>>>  /*
> > > > >>>>   * Sub-module helpers      
> > > > >>>
> > > > >>>
> > > > >>> I think you need to take a closer look of the lifecycle of a container,
> > > > >>> having a reference means the container itself won't go away, but only
> > > > >>> having a group set within that container holds the actual IOMMU
> > > > >>> references.  container->iommu_data is going to be NULL once the
> > > > >>> groups are lost.  Thanks,      
> > > > >>
> > > > >>
> > > > >> Container owns the iommu tables and this is what I care about here, groups
> > > > >> attached or not - this is handled separately via IOMMU group list in a
> > > > >> specific iommu_table struct, these groups get detached from iommu_table
> > > > >> when they are removed from a container.    
> > > > > 
> > > > > The container doesn't own anything, the container is privileged by the
> > > > > groups being attached to it.  When groups are closed, they detach from
> > > > > the container and once the container group list is empty the iommu
> > > > > backend is released and iommu_data is NULL.  A container reference
> > > > > doesn't give you what you're looking for.  It implies nothing about the
> > > > > iommu backend.    
> > > > 
> > > > 
> > > > Well. Backend is a part of a container and since a backend owns tables, a
> > > > container owns them too.  
> > > 
> > > The IOMMU backend is accessed through the container, but that backend
> > > is privileged by the groups it contains.  Once those groups are gone,
> > > the IOMMU backend is released, regardless of whatever reference you
> > > have to the container itself such as you're attempting to do here.  In
> > > that sense, the container does not own those tables.  
> > 
> > So, the thing is that what KVM fundamentally needs is a handle on the
> > container.  KVM is essentially modelling the DMA address space of a
> > single guest bus, and the container is what's attached to that.
> > 
> > The first part of the problem is that KVM wants to basically invoke
> > vfio_dma_map() operations without bouncing via qemu.  Because
> > vfio_dma_map() works on the container level, that's the handle that
> > KVM needs to hold.
> > 
> > The second part of the problem is that in order to reduce overhead
> > further, we want to operate in real mode, which means bypassing most
> > of the usual VFIO structure and going directly(ish) from the KVM
> > hcall emulation to the IOMMU backend behind VFIO.  This complicates
> > matters a fair bit.  Because it is, explicitly, a performance hack,
> > some degree of ugliness is probably inevitable.
> > 
> > Alexey - actually implementing this in two stages might make this
> > clearer.  The first stage wouldn't allow real mode, and would call
> > through the same vfio_dma_map() path as qemu calls through now.  The
> > second stage would then put in place the necessary hacks to add real
> > mode support.
> > 
> > > > The problem I am trying to solve here is when KVM may release the
> > > > iommu_table objects.
> > > > 
> > > > "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> > > > matter) makes a link between KVM-spapr-tce-table and container and KVM can
> > > > start using tables (with referencing them).
> > > > 
> > > > First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> > > > from region_del() and this works if QEMU removes a window. However if QEMU
> > > > removes a vfio-pci device, region_del() is not called and KVM does not get
> > > > notified that it can release the iommu_table's because the
> > > > KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> > > > still used by emulated devices or other containers).
> > > > 
> > > > So it was suggested that we could do such "unset" somehow later assuming,
> > > > for example, on every "set" I could check if some of currently attached
> > > > containers are no more used - and this is where being able to know if there
> > > > is no backend helps - KVM remembers a container pointer and can check this
> > > > via vfio_container_get_iommu_data_ext().
> > > > 
> > > > The other option would be changing vfio_container_get_ext() to take a
> > > > callback+opaque which container would call when it destroys iommu_data.
> > > > This looks more intrusive and not very intuitive how to make it right -
> > > > container would have to keep track of all registered external users and
> > > > vfio_container_put_ext() would have to pass the same callback+opaque to
> > > > unregister the exact external user.  
> > > 
> > > I'm not in favor of anything resembling the code above or extensions
> > > beyond it, the container is the wrong place to do this.
> > >   
> > > > Or I could store container file* in KVM. Then iommu_data would never be
> > > > released until KVM-spapr-tce-table is destroyed.  
> > > 
> > > See above, holding a file pointer to the container doesn't do squat.
> > > The groups that are held by the container empower the IOMMU backend,
> > > references to the container itself don't matter.  Those references will
> > > not maintain the IOMMU data.
> > >    
> > > > Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> > > > would "unset" container from KVM-spapr-tce-table) is not an option as there
> > > > still may be devices using this KVM-spapr-tce-table.
> > > > 
> > > > What obvious and nice solution am I missing here? Thanks.  
> > > 
> > > The interactions with the IOMMU backend that seem relevant are
> > > vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> > > device is also used to tell kvm about groups as they come and go and
> > > has a way to check extensions, and thus properties of the IOMMU
> > > backend.  All of these are available for your {ab}use.  Thanks,  
> > 
> > So, Alexey started trying to do this via the KVM-VFIO device, but it's
> > a really bad fit.  As noted above, fundamentally it's a container we
> > need to attach to the kvm-spapr-tce-table object, since what that
> > represents is a guest bus DMA address space, and by definition all the
> > groups in a container must have the same DMA address space.
> 
> That's all fine and good, but the point remains that a reference to the
> container is no assurance of the iommu state.  The iommu state is
> maintained by the user and the groups attached to the container.  If
> the groups are removed, your container reference no long has any iommu
> backing and iommu_data is worthless.  The user can do this as well by
> un-setting the iommu.  I understand what you're trying to do, it's just
> wrong.  Thanks,

I'm trying to figure out how to do this right, and it's not at all
obvious.  The container may be wrong, but that doesn't have the
KVM-VFIO device any more useful.  Attempting to do this at the group
level is at least as wrong for the reasons I've mentioned elsewhere.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-15 11:07                 ` David Gibson
@ 2016-08-17  8:31                   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-17  8:31 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, linuxppc-dev, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 13122 bytes --]

On 15/08/16 21:07, David Gibson wrote:
> On Fri, Aug 12, 2016 at 04:12:17PM +1000, Alexey Kardashevskiy wrote:
>> On 12/08/16 15:46, David Gibson wrote:
>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
>>>> On Wed, 10 Aug 2016 15:37:17 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>
>>>>> On 09/08/16 22:16, Alex Williamson wrote:
>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>   
>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:  
>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>     
>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
>>>>>>>>> memory while there are external users such as KVM.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>> ---
>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>>>>> index d1d70e0..baf6a9c 100644
>>>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>>>>>>  
>>>>>>>>>  /**
>>>>>>>>> + * External user API for containers, exported by symbols to be linked
>>>>>>>>> + * dynamically.
>>>>>>>>> + *
>>>>>>>>> + */
>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>>>>>>> +{
>>>>>>>>> +	struct vfio_container *container = filep->private_data;
>>>>>>>>> +
>>>>>>>>> +	if (filep->f_op != &vfio_fops)
>>>>>>>>> +		return ERR_PTR(-EINVAL);
>>>>>>>>> +
>>>>>>>>> +	vfio_container_get(container);
>>>>>>>>> +
>>>>>>>>> +	return container;
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>>>>>>> +
>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>>>>>>> +{
>>>>>>>>> +	vfio_container_put(container);
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>>>>>>> +
>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>>>>>>> +{
>>>>>>>>> +	return container->iommu_data;
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>>   * Sub-module support
>>>>>>>>>   */
>>>>>>>>>  /*
>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>> index 3594ad3..fceea3d 100644
>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
>>>>>>>>>  };
>>>>>>>>>  
>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>>>>>>> +		u64 offset)
>>>>>>>>> +{
>>>>>>>>> +	struct tce_container *container = iommu_data;
>>>>>>>>> +	struct iommu_table *tbl = NULL;
>>>>>>>>> +
>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>>>>>>> +		return NULL;
>>>>>>>>> +
>>>>>>>>> +	iommu_table_get(tbl);
>>>>>>>>> +
>>>>>>>>> +	return tbl;
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>>>>>>> +
>>>>>>>>>  static int __init tce_iommu_init(void)
>>>>>>>>>  {
>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>>>>>>  MODULE_LICENSE("GPL v2");
>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
>>>>>>>>> -
>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>>>>> index 0ecae0b..1c2138a 100644
>>>>>>>>> --- a/include/linux/vfio.h
>>>>>>>>> +++ b/include/linux/vfio.h
>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>>>>>>>>  					  unsigned long arg);
>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>>>>>>> +		struct vfio_container *container);
>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>>>>>>> +		void *iommu_data, u64 offset);
>>>>>>>>>  
>>>>>>>>>  /*
>>>>>>>>>   * Sub-module helpers    
>>>>>>>>
>>>>>>>>
>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
>>>>>>>> having a reference means the container itself won't go away, but only
>>>>>>>> having a group set within that container holds the actual IOMMU
>>>>>>>> references.  container->iommu_data is going to be NULL once the
>>>>>>>> groups are lost.  Thanks,    
>>>>>>>
>>>>>>>
>>>>>>> Container owns the iommu tables and this is what I care about here, groups
>>>>>>> attached or not - this is handled separately via IOMMU group list in a
>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
>>>>>>> when they are removed from a container.  
>>>>>>
>>>>>> The container doesn't own anything, the container is privileged by the
>>>>>> groups being attached to it.  When groups are closed, they detach from
>>>>>> the container and once the container group list is empty the iommu
>>>>>> backend is released and iommu_data is NULL.  A container reference
>>>>>> doesn't give you what you're looking for.  It implies nothing about the
>>>>>> iommu backend.  
>>>>>
>>>>>
>>>>> Well. Backend is a part of a container and since a backend owns tables, a
>>>>> container owns them too.
>>>>
>>>> The IOMMU backend is accessed through the container, but that backend
>>>> is privileged by the groups it contains.  Once those groups are gone,
>>>> the IOMMU backend is released, regardless of whatever reference you
>>>> have to the container itself such as you're attempting to do here.  In
>>>> that sense, the container does not own those tables.
>>>
>>> So, the thing is that what KVM fundamentally needs is a handle on the
>>> container.  KVM is essentially modelling the DMA address space of a
>>> single guest bus, and the container is what's attached to that.
>>>
>>> The first part of the problem is that KVM wants to basically invoke
>>> vfio_dma_map() operations without bouncing via qemu.  Because
>>> vfio_dma_map() works on the container level, that's the handle that
>>> KVM needs to hold.
>>
>>
>> Well, I do not need to hold the reference to the container all the time, I
>> just need it to get to the IOMMU backend, get+reference an iommu_table from
>> it, referencing here helps to make sure the backend is not going away
>> before we reference iommu_table.
> 
> Yes, but I don't see a compelling reason *not* to hold the container
> reference either - it seems like principle of least surprise would
> suggest retaining the reference.
>
> For example, I can imagine having a container reset call which threw
> away the back end iommu table and created a new one.  It seems like
> what you'd expect in this case is for the guest bus to remain bound to
> the same container, not to the now stale iommu table.
> 
>> After that I only keep a reference to the container to know if/when I can
>> release a particular iommu_table. This is can workaround by counting how
>> many groups were attached to this particular KVM-spapt-tce-table and
>> looking at the IOMMU group list attached to an iommu_table - if the list is
>> empty, decrement the iommu_table reference counter and that's it, no extra
>> references to a VFIO container.
>>
>> Or I need an alternative way of getting iommu_table's, i.e. QEMU should
>> somehow tell KVM that this LIOBN is this VFIO container fd (easy - can be
>> done via region_add/region_del interface)
> 
> Um.. yes.. that's what I was expecting, I thought that was what you
> were doing.x
> 
>> or VFIO IOMMU group fd(s) (more
>> tricky as this needs to be done from more places - vfio-pci hotplug/unplug,
>> window add/remove).
> 
> More tricky and also wrong.  Again, having one group but not the whole
> container bound to the guest LIOBN doesn't make any sense - by
> definition, all the devices in the container should share the same DMA
> address space.
> 
>>> The second part of the problem is that in order to reduce overhead
>>> further, we want to operate in real mode, which means bypassing most
>>> of the usual VFIO structure and going directly(ish) from the KVM
>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
>>> matters a fair bit.  Because it is, explicitly, a performance hack,
>>> some degree of ugliness is probably inevitable.
>>>
>>> Alexey - actually implementing this in two stages might make this
>>> clearer.  The first stage wouldn't allow real mode, and would call
>>> through the same vfio_dma_map() path as qemu calls through now.  The
>>> second stage would then put in place the necessary hacks to add real
>>> mode support.
>>>
>>>>> The problem I am trying to solve here is when KVM may release the
>>>>> iommu_table objects.
>>>>>
>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
>>>>> start using tables (with referencing them).
>>>>>
>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
>>>>> notified that it can release the iommu_table's because the
>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
>>>>> still used by emulated devices or other containers).
>>>>>
>>>>> So it was suggested that we could do such "unset" somehow later assuming,
>>>>> for example, on every "set" I could check if some of currently attached
>>>>> containers are no more used - and this is where being able to know if there
>>>>> is no backend helps - KVM remembers a container pointer and can check this
>>>>> via vfio_container_get_iommu_data_ext().
>>>>>
>>>>> The other option would be changing vfio_container_get_ext() to take a
>>>>> callback+opaque which container would call when it destroys iommu_data.
>>>>> This looks more intrusive and not very intuitive how to make it right -
>>>>> container would have to keep track of all registered external users and
>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
>>>>> unregister the exact external user.
>>>>
>>>> I'm not in favor of anything resembling the code above or extensions
>>>> beyond it, the container is the wrong place to do this.
>>>>
>>>>> Or I could store container file* in KVM. Then iommu_data would never be
>>>>> released until KVM-spapr-tce-table is destroyed.
>>>>
>>>> See above, holding a file pointer to the container doesn't do squat.
>>>> The groups that are held by the container empower the IOMMU backend,
>>>> references to the container itself don't matter.  Those references will
>>>> not maintain the IOMMU data.
>>>>  
>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
>>>>> still may be devices using this KVM-spapr-tce-table.
>>>>>
>>>>> What obvious and nice solution am I missing here? Thanks.
>>>>
>>>> The interactions with the IOMMU backend that seem relevant are
>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
>>>> device is also used to tell kvm about groups as they come and go and
>>>> has a way to check extensions, and thus properties of the IOMMU
>>>> backend.  All of these are available for your {ab}use.  Thanks,
>>>
>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
>>> a really bad fit.  As noted above, fundamentally it's a container we
>>> need to attach to the kvm-spapr-tce-table object, since what that
>>> represents is a guest bus DMA address space, and by definition all the
>>> groups in a container must have the same DMA address space.
>>
>>
>> Well, in a bad case a LIOBN/kvm-spapr-tce-table has multiple containers
>> attached so it is not 1:1...
> 
> I never said it was.  It's n:1, but it's *not* n:m.  You can have
> multiple containers to a LIOBN, but never multiple LIOBNs ot a
> container.


Just to clarify things - there are 2 LIOBNs (1 per window) per a container
actually - 32bit one and 64bit one :)



-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-17  3:17                 ` David Gibson
@ 2016-08-18  0:22                   ` Alexey Kardashevskiy
  2016-08-29  6:35                     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-18  0:22 UTC (permalink / raw)
  To: David Gibson, Alex Williamson; +Cc: linuxppc-dev, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 12477 bytes --]

On 17/08/16 13:17, David Gibson wrote:
> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
>> On Fri, 12 Aug 2016 15:46:01 +1000
>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
>>>> On Wed, 10 Aug 2016 15:37:17 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>   
>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>     
>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>       
>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
>>>>>>>>> memory while there are external users such as KVM.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>> ---
>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>>>>> index d1d70e0..baf6a9c 100644
>>>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>>>>>>  
>>>>>>>>>  /**
>>>>>>>>> + * External user API for containers, exported by symbols to be linked
>>>>>>>>> + * dynamically.
>>>>>>>>> + *
>>>>>>>>> + */
>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>>>>>>> +{
>>>>>>>>> +	struct vfio_container *container = filep->private_data;
>>>>>>>>> +
>>>>>>>>> +	if (filep->f_op != &vfio_fops)
>>>>>>>>> +		return ERR_PTR(-EINVAL);
>>>>>>>>> +
>>>>>>>>> +	vfio_container_get(container);
>>>>>>>>> +
>>>>>>>>> +	return container;
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>>>>>>> +
>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>>>>>>> +{
>>>>>>>>> +	vfio_container_put(container);
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>>>>>>> +
>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>>>>>>> +{
>>>>>>>>> +	return container->iommu_data;
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>>   * Sub-module support
>>>>>>>>>   */
>>>>>>>>>  /*
>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>> index 3594ad3..fceea3d 100644
>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
>>>>>>>>>  };
>>>>>>>>>  
>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>>>>>>> +		u64 offset)
>>>>>>>>> +{
>>>>>>>>> +	struct tce_container *container = iommu_data;
>>>>>>>>> +	struct iommu_table *tbl = NULL;
>>>>>>>>> +
>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>>>>>>> +		return NULL;
>>>>>>>>> +
>>>>>>>>> +	iommu_table_get(tbl);
>>>>>>>>> +
>>>>>>>>> +	return tbl;
>>>>>>>>> +}
>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>>>>>>> +
>>>>>>>>>  static int __init tce_iommu_init(void)
>>>>>>>>>  {
>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>>>>>>  MODULE_LICENSE("GPL v2");
>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
>>>>>>>>> -
>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>>>>> index 0ecae0b..1c2138a 100644
>>>>>>>>> --- a/include/linux/vfio.h
>>>>>>>>> +++ b/include/linux/vfio.h
>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>>>>>>>>  					  unsigned long arg);
>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>>>>>>> +		struct vfio_container *container);
>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>>>>>>> +		void *iommu_data, u64 offset);
>>>>>>>>>  
>>>>>>>>>  /*
>>>>>>>>>   * Sub-module helpers      
>>>>>>>>
>>>>>>>>
>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
>>>>>>>> having a reference means the container itself won't go away, but only
>>>>>>>> having a group set within that container holds the actual IOMMU
>>>>>>>> references.  container->iommu_data is going to be NULL once the
>>>>>>>> groups are lost.  Thanks,      
>>>>>>>
>>>>>>>
>>>>>>> Container owns the iommu tables and this is what I care about here, groups
>>>>>>> attached or not - this is handled separately via IOMMU group list in a
>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
>>>>>>> when they are removed from a container.    
>>>>>>
>>>>>> The container doesn't own anything, the container is privileged by the
>>>>>> groups being attached to it.  When groups are closed, they detach from
>>>>>> the container and once the container group list is empty the iommu
>>>>>> backend is released and iommu_data is NULL.  A container reference
>>>>>> doesn't give you what you're looking for.  It implies nothing about the
>>>>>> iommu backend.    
>>>>>
>>>>>
>>>>> Well. Backend is a part of a container and since a backend owns tables, a
>>>>> container owns them too.  
>>>>
>>>> The IOMMU backend is accessed through the container, but that backend
>>>> is privileged by the groups it contains.  Once those groups are gone,
>>>> the IOMMU backend is released, regardless of whatever reference you
>>>> have to the container itself such as you're attempting to do here.  In
>>>> that sense, the container does not own those tables.  
>>>
>>> So, the thing is that what KVM fundamentally needs is a handle on the
>>> container.  KVM is essentially modelling the DMA address space of a
>>> single guest bus, and the container is what's attached to that.
>>>
>>> The first part of the problem is that KVM wants to basically invoke
>>> vfio_dma_map() operations without bouncing via qemu.  Because
>>> vfio_dma_map() works on the container level, that's the handle that
>>> KVM needs to hold.
>>>
>>> The second part of the problem is that in order to reduce overhead
>>> further, we want to operate in real mode, which means bypassing most
>>> of the usual VFIO structure and going directly(ish) from the KVM
>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
>>> matters a fair bit.  Because it is, explicitly, a performance hack,
>>> some degree of ugliness is probably inevitable.
>>>
>>> Alexey - actually implementing this in two stages might make this
>>> clearer.  The first stage wouldn't allow real mode, and would call
>>> through the same vfio_dma_map() path as qemu calls through now.  The
>>> second stage would then put in place the necessary hacks to add real
>>> mode support.
>>>
>>>>> The problem I am trying to solve here is when KVM may release the
>>>>> iommu_table objects.
>>>>>
>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
>>>>> start using tables (with referencing them).
>>>>>
>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
>>>>> notified that it can release the iommu_table's because the
>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
>>>>> still used by emulated devices or other containers).
>>>>>
>>>>> So it was suggested that we could do such "unset" somehow later assuming,
>>>>> for example, on every "set" I could check if some of currently attached
>>>>> containers are no more used - and this is where being able to know if there
>>>>> is no backend helps - KVM remembers a container pointer and can check this
>>>>> via vfio_container_get_iommu_data_ext().
>>>>>
>>>>> The other option would be changing vfio_container_get_ext() to take a
>>>>> callback+opaque which container would call when it destroys iommu_data.
>>>>> This looks more intrusive and not very intuitive how to make it right -
>>>>> container would have to keep track of all registered external users and
>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
>>>>> unregister the exact external user.  
>>>>
>>>> I'm not in favor of anything resembling the code above or extensions
>>>> beyond it, the container is the wrong place to do this.
>>>>   
>>>>> Or I could store container file* in KVM. Then iommu_data would never be
>>>>> released until KVM-spapr-tce-table is destroyed.  
>>>>
>>>> See above, holding a file pointer to the container doesn't do squat.
>>>> The groups that are held by the container empower the IOMMU backend,
>>>> references to the container itself don't matter.  Those references will
>>>> not maintain the IOMMU data.
>>>>    
>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
>>>>> still may be devices using this KVM-spapr-tce-table.
>>>>>
>>>>> What obvious and nice solution am I missing here? Thanks.  
>>>>
>>>> The interactions with the IOMMU backend that seem relevant are
>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
>>>> device is also used to tell kvm about groups as they come and go and
>>>> has a way to check extensions, and thus properties of the IOMMU
>>>> backend.  All of these are available for your {ab}use.  Thanks,  
>>>
>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
>>> a really bad fit.  As noted above, fundamentally it's a container we
>>> need to attach to the kvm-spapr-tce-table object, since what that
>>> represents is a guest bus DMA address space, and by definition all the
>>> groups in a container must have the same DMA address space.
>>
>> That's all fine and good, but the point remains that a reference to the
>> container is no assurance of the iommu state.  The iommu state is
>> maintained by the user and the groups attached to the container.  If
>> the groups are removed, your container reference no long has any iommu
>> backing and iommu_data is worthless.  The user can do this as well by
>> un-setting the iommu.  I understand what you're trying to do, it's just
>> wrong.  Thanks,
> 
> I'm trying to figure out how to do this right, and it's not at all
> obvious.  The container may be wrong, but that doesn't have the
> KVM-VFIO device any more useful.  Attempting to do this at the group
> level is at least as wrong for the reasons I've mentioned elsewhere.
> 

I could create a new fd, one per iommu_table, the fd would reference the
iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
release the reference, KVM-spapr-tce-table would have "unset" ioctl()
or/and on every "set" I would look if all attached tables have at least one
iommu_table_group attached, if none - release the table.

This would make no change to generic VFIO code and very little change in
SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.




-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-18  0:22                   ` Alexey Kardashevskiy
@ 2016-08-29  6:35                     ` Alexey Kardashevskiy
  2016-08-29 13:27                       ` David Gibson
  0 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-08-29  6:35 UTC (permalink / raw)
  To: David Gibson, Alex Williamson; +Cc: linuxppc-dev, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 12795 bytes --]

On 18/08/16 10:22, Alexey Kardashevskiy wrote:
> On 17/08/16 13:17, David Gibson wrote:
>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
>>> On Fri, 12 Aug 2016 15:46:01 +1000
>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>
>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
>>>>> On Wed, 10 Aug 2016 15:37:17 +1000
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>   
>>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>     
>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
>>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>       
>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
>>>>>>>>>> memory while there are external users such as KVM.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
>>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>>>>>> index d1d70e0..baf6a9c 100644
>>>>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>>>>>>>  
>>>>>>>>>>  /**
>>>>>>>>>> + * External user API for containers, exported by symbols to be linked
>>>>>>>>>> + * dynamically.
>>>>>>>>>> + *
>>>>>>>>>> + */
>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>>>>>>>> +{
>>>>>>>>>> +	struct vfio_container *container = filep->private_data;
>>>>>>>>>> +
>>>>>>>>>> +	if (filep->f_op != &vfio_fops)
>>>>>>>>>> +		return ERR_PTR(-EINVAL);
>>>>>>>>>> +
>>>>>>>>>> +	vfio_container_get(container);
>>>>>>>>>> +
>>>>>>>>>> +	return container;
>>>>>>>>>> +}
>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>>>>>>>> +
>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>>>>>>>> +{
>>>>>>>>>> +	vfio_container_put(container);
>>>>>>>>>> +}
>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>>>>>>>> +
>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>>>>>>>> +{
>>>>>>>>>> +	return container->iommu_data;
>>>>>>>>>> +}
>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>>   * Sub-module support
>>>>>>>>>>   */
>>>>>>>>>>  /*
>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>> index 3594ad3..fceea3d 100644
>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
>>>>>>>>>>  };
>>>>>>>>>>  
>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>>>>>>>> +		u64 offset)
>>>>>>>>>> +{
>>>>>>>>>> +	struct tce_container *container = iommu_data;
>>>>>>>>>> +	struct iommu_table *tbl = NULL;
>>>>>>>>>> +
>>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>>>>>>>> +		return NULL;
>>>>>>>>>> +
>>>>>>>>>> +	iommu_table_get(tbl);
>>>>>>>>>> +
>>>>>>>>>> +	return tbl;
>>>>>>>>>> +}
>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>>>>>>>> +
>>>>>>>>>>  static int __init tce_iommu_init(void)
>>>>>>>>>>  {
>>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>>>>>>>  MODULE_LICENSE("GPL v2");
>>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
>>>>>>>>>> -
>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>>>>>> index 0ecae0b..1c2138a 100644
>>>>>>>>>> --- a/include/linux/vfio.h
>>>>>>>>>> +++ b/include/linux/vfio.h
>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>>>>>>>>>  					  unsigned long arg);
>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>>>>>>>> +		struct vfio_container *container);
>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>>>>>>>> +		void *iommu_data, u64 offset);
>>>>>>>>>>  
>>>>>>>>>>  /*
>>>>>>>>>>   * Sub-module helpers      
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
>>>>>>>>> having a reference means the container itself won't go away, but only
>>>>>>>>> having a group set within that container holds the actual IOMMU
>>>>>>>>> references.  container->iommu_data is going to be NULL once the
>>>>>>>>> groups are lost.  Thanks,      
>>>>>>>>
>>>>>>>>
>>>>>>>> Container owns the iommu tables and this is what I care about here, groups
>>>>>>>> attached or not - this is handled separately via IOMMU group list in a
>>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
>>>>>>>> when they are removed from a container.    
>>>>>>>
>>>>>>> The container doesn't own anything, the container is privileged by the
>>>>>>> groups being attached to it.  When groups are closed, they detach from
>>>>>>> the container and once the container group list is empty the iommu
>>>>>>> backend is released and iommu_data is NULL.  A container reference
>>>>>>> doesn't give you what you're looking for.  It implies nothing about the
>>>>>>> iommu backend.    
>>>>>>
>>>>>>
>>>>>> Well. Backend is a part of a container and since a backend owns tables, a
>>>>>> container owns them too.  
>>>>>
>>>>> The IOMMU backend is accessed through the container, but that backend
>>>>> is privileged by the groups it contains.  Once those groups are gone,
>>>>> the IOMMU backend is released, regardless of whatever reference you
>>>>> have to the container itself such as you're attempting to do here.  In
>>>>> that sense, the container does not own those tables.  
>>>>
>>>> So, the thing is that what KVM fundamentally needs is a handle on the
>>>> container.  KVM is essentially modelling the DMA address space of a
>>>> single guest bus, and the container is what's attached to that.
>>>>
>>>> The first part of the problem is that KVM wants to basically invoke
>>>> vfio_dma_map() operations without bouncing via qemu.  Because
>>>> vfio_dma_map() works on the container level, that's the handle that
>>>> KVM needs to hold.
>>>>
>>>> The second part of the problem is that in order to reduce overhead
>>>> further, we want to operate in real mode, which means bypassing most
>>>> of the usual VFIO structure and going directly(ish) from the KVM
>>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
>>>> matters a fair bit.  Because it is, explicitly, a performance hack,
>>>> some degree of ugliness is probably inevitable.
>>>>
>>>> Alexey - actually implementing this in two stages might make this
>>>> clearer.  The first stage wouldn't allow real mode, and would call
>>>> through the same vfio_dma_map() path as qemu calls through now.  The
>>>> second stage would then put in place the necessary hacks to add real
>>>> mode support.
>>>>
>>>>>> The problem I am trying to solve here is when KVM may release the
>>>>>> iommu_table objects.
>>>>>>
>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
>>>>>> start using tables (with referencing them).
>>>>>>
>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
>>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
>>>>>> notified that it can release the iommu_table's because the
>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
>>>>>> still used by emulated devices or other containers).
>>>>>>
>>>>>> So it was suggested that we could do such "unset" somehow later assuming,
>>>>>> for example, on every "set" I could check if some of currently attached
>>>>>> containers are no more used - and this is where being able to know if there
>>>>>> is no backend helps - KVM remembers a container pointer and can check this
>>>>>> via vfio_container_get_iommu_data_ext().
>>>>>>
>>>>>> The other option would be changing vfio_container_get_ext() to take a
>>>>>> callback+opaque which container would call when it destroys iommu_data.
>>>>>> This looks more intrusive and not very intuitive how to make it right -
>>>>>> container would have to keep track of all registered external users and
>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
>>>>>> unregister the exact external user.  
>>>>>
>>>>> I'm not in favor of anything resembling the code above or extensions
>>>>> beyond it, the container is the wrong place to do this.
>>>>>   
>>>>>> Or I could store container file* in KVM. Then iommu_data would never be
>>>>>> released until KVM-spapr-tce-table is destroyed.  
>>>>>
>>>>> See above, holding a file pointer to the container doesn't do squat.
>>>>> The groups that are held by the container empower the IOMMU backend,
>>>>> references to the container itself don't matter.  Those references will
>>>>> not maintain the IOMMU data.
>>>>>    
>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
>>>>>> still may be devices using this KVM-spapr-tce-table.
>>>>>>
>>>>>> What obvious and nice solution am I missing here? Thanks.  
>>>>>
>>>>> The interactions with the IOMMU backend that seem relevant are
>>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
>>>>> device is also used to tell kvm about groups as they come and go and
>>>>> has a way to check extensions, and thus properties of the IOMMU
>>>>> backend.  All of these are available for your {ab}use.  Thanks,  
>>>>
>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
>>>> a really bad fit.  As noted above, fundamentally it's a container we
>>>> need to attach to the kvm-spapr-tce-table object, since what that
>>>> represents is a guest bus DMA address space, and by definition all the
>>>> groups in a container must have the same DMA address space.
>>>
>>> That's all fine and good, but the point remains that a reference to the
>>> container is no assurance of the iommu state.  The iommu state is
>>> maintained by the user and the groups attached to the container.  If
>>> the groups are removed, your container reference no long has any iommu
>>> backing and iommu_data is worthless.  The user can do this as well by
>>> un-setting the iommu.  I understand what you're trying to do, it's just
>>> wrong.  Thanks,
>>
>> I'm trying to figure out how to do this right, and it's not at all
>> obvious.  The container may be wrong, but that doesn't have the
>> KVM-VFIO device any more useful.  Attempting to do this at the group
>> level is at least as wrong for the reasons I've mentioned elsewhere.
>>
> 
> I could create a new fd, one per iommu_table, the fd would reference the
> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
> release the reference, KVM-spapr-tce-table would have "unset" ioctl()
> or/and on every "set" I would look if all attached tables have at least one
> iommu_table_group attached, if none - release the table.
> 
> This would make no change to generic VFIO code and very little change in
> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.


Ping?



-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-29  6:35                     ` Alexey Kardashevskiy
@ 2016-08-29 13:27                       ` David Gibson
  2016-09-07  9:09                         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 60+ messages in thread
From: David Gibson @ 2016-08-29 13:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, linuxppc-dev, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 14004 bytes --]

On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote:
> On 18/08/16 10:22, Alexey Kardashevskiy wrote:
> > On 17/08/16 13:17, David Gibson wrote:
> >> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
> >>> On Fri, 12 Aug 2016 15:46:01 +1000
> >>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>
> >>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> >>>>> On Wed, 10 Aug 2016 15:37:17 +1000
> >>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>   
> >>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
> >>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
> >>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>     
> >>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
> >>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
> >>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>>>       
> >>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
> >>>>>>>>>> memory while there are external users such as KVM.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>> ---
> >>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> >>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> >>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
> >>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>>>>>>>> index d1d70e0..baf6a9c 100644
> >>>>>>>>>> --- a/drivers/vfio/vfio.c
> >>>>>>>>>> +++ b/drivers/vfio/vfio.c
> >>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> >>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> >>>>>>>>>>  
> >>>>>>>>>>  /**
> >>>>>>>>>> + * External user API for containers, exported by symbols to be linked
> >>>>>>>>>> + * dynamically.
> >>>>>>>>>> + *
> >>>>>>>>>> + */
> >>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> >>>>>>>>>> +{
> >>>>>>>>>> +	struct vfio_container *container = filep->private_data;
> >>>>>>>>>> +
> >>>>>>>>>> +	if (filep->f_op != &vfio_fops)
> >>>>>>>>>> +		return ERR_PTR(-EINVAL);
> >>>>>>>>>> +
> >>>>>>>>>> +	vfio_container_get(container);
> >>>>>>>>>> +
> >>>>>>>>>> +	return container;
> >>>>>>>>>> +}
> >>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> >>>>>>>>>> +
> >>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
> >>>>>>>>>> +{
> >>>>>>>>>> +	vfio_container_put(container);
> >>>>>>>>>> +}
> >>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> >>>>>>>>>> +
> >>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> >>>>>>>>>> +{
> >>>>>>>>>> +	return container->iommu_data;
> >>>>>>>>>> +}
> >>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> >>>>>>>>>> +
> >>>>>>>>>> +/**
> >>>>>>>>>>   * Sub-module support
> >>>>>>>>>>   */
> >>>>>>>>>>  /*
> >>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>> index 3594ad3..fceea3d 100644
> >>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> >>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
> >>>>>>>>>>  };
> >>>>>>>>>>  
> >>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> >>>>>>>>>> +		u64 offset)
> >>>>>>>>>> +{
> >>>>>>>>>> +	struct tce_container *container = iommu_data;
> >>>>>>>>>> +	struct iommu_table *tbl = NULL;
> >>>>>>>>>> +
> >>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> >>>>>>>>>> +		return NULL;
> >>>>>>>>>> +
> >>>>>>>>>> +	iommu_table_get(tbl);
> >>>>>>>>>> +
> >>>>>>>>>> +	return tbl;
> >>>>>>>>>> +}
> >>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> >>>>>>>>>> +
> >>>>>>>>>>  static int __init tce_iommu_init(void)
> >>>>>>>>>>  {
> >>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> >>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> >>>>>>>>>>  MODULE_LICENSE("GPL v2");
> >>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> >>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> >>>>>>>>>> -
> >>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>>>>>>>> index 0ecae0b..1c2138a 100644
> >>>>>>>>>> --- a/include/linux/vfio.h
> >>>>>>>>>> +++ b/include/linux/vfio.h
> >>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> >>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> >>>>>>>>>>  					  unsigned long arg);
> >>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> >>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> >>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
> >>>>>>>>>> +		struct vfio_container *container);
> >>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> >>>>>>>>>> +		void *iommu_data, u64 offset);
> >>>>>>>>>>  
> >>>>>>>>>>  /*
> >>>>>>>>>>   * Sub-module helpers      
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
> >>>>>>>>> having a reference means the container itself won't go away, but only
> >>>>>>>>> having a group set within that container holds the actual IOMMU
> >>>>>>>>> references.  container->iommu_data is going to be NULL once the
> >>>>>>>>> groups are lost.  Thanks,      
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Container owns the iommu tables and this is what I care about here, groups
> >>>>>>>> attached or not - this is handled separately via IOMMU group list in a
> >>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
> >>>>>>>> when they are removed from a container.    
> >>>>>>>
> >>>>>>> The container doesn't own anything, the container is privileged by the
> >>>>>>> groups being attached to it.  When groups are closed, they detach from
> >>>>>>> the container and once the container group list is empty the iommu
> >>>>>>> backend is released and iommu_data is NULL.  A container reference
> >>>>>>> doesn't give you what you're looking for.  It implies nothing about the
> >>>>>>> iommu backend.    
> >>>>>>
> >>>>>>
> >>>>>> Well. Backend is a part of a container and since a backend owns tables, a
> >>>>>> container owns them too.  
> >>>>>
> >>>>> The IOMMU backend is accessed through the container, but that backend
> >>>>> is privileged by the groups it contains.  Once those groups are gone,
> >>>>> the IOMMU backend is released, regardless of whatever reference you
> >>>>> have to the container itself such as you're attempting to do here.  In
> >>>>> that sense, the container does not own those tables.  
> >>>>
> >>>> So, the thing is that what KVM fundamentally needs is a handle on the
> >>>> container.  KVM is essentially modelling the DMA address space of a
> >>>> single guest bus, and the container is what's attached to that.
> >>>>
> >>>> The first part of the problem is that KVM wants to basically invoke
> >>>> vfio_dma_map() operations without bouncing via qemu.  Because
> >>>> vfio_dma_map() works on the container level, that's the handle that
> >>>> KVM needs to hold.
> >>>>
> >>>> The second part of the problem is that in order to reduce overhead
> >>>> further, we want to operate in real mode, which means bypassing most
> >>>> of the usual VFIO structure and going directly(ish) from the KVM
> >>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
> >>>> matters a fair bit.  Because it is, explicitly, a performance hack,
> >>>> some degree of ugliness is probably inevitable.
> >>>>
> >>>> Alexey - actually implementing this in two stages might make this
> >>>> clearer.  The first stage wouldn't allow real mode, and would call
> >>>> through the same vfio_dma_map() path as qemu calls through now.  The
> >>>> second stage would then put in place the necessary hacks to add real
> >>>> mode support.
> >>>>
> >>>>>> The problem I am trying to solve here is when KVM may release the
> >>>>>> iommu_table objects.
> >>>>>>
> >>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> >>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
> >>>>>> start using tables (with referencing them).
> >>>>>>
> >>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> >>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
> >>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
> >>>>>> notified that it can release the iommu_table's because the
> >>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> >>>>>> still used by emulated devices or other containers).
> >>>>>>
> >>>>>> So it was suggested that we could do such "unset" somehow later assuming,
> >>>>>> for example, on every "set" I could check if some of currently attached
> >>>>>> containers are no more used - and this is where being able to know if there
> >>>>>> is no backend helps - KVM remembers a container pointer and can check this
> >>>>>> via vfio_container_get_iommu_data_ext().
> >>>>>>
> >>>>>> The other option would be changing vfio_container_get_ext() to take a
> >>>>>> callback+opaque which container would call when it destroys iommu_data.
> >>>>>> This looks more intrusive and not very intuitive how to make it right -
> >>>>>> container would have to keep track of all registered external users and
> >>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
> >>>>>> unregister the exact external user.  
> >>>>>
> >>>>> I'm not in favor of anything resembling the code above or extensions
> >>>>> beyond it, the container is the wrong place to do this.
> >>>>>   
> >>>>>> Or I could store container file* in KVM. Then iommu_data would never be
> >>>>>> released until KVM-spapr-tce-table is destroyed.  
> >>>>>
> >>>>> See above, holding a file pointer to the container doesn't do squat.
> >>>>> The groups that are held by the container empower the IOMMU backend,
> >>>>> references to the container itself don't matter.  Those references will
> >>>>> not maintain the IOMMU data.
> >>>>>    
> >>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> >>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
> >>>>>> still may be devices using this KVM-spapr-tce-table.
> >>>>>>
> >>>>>> What obvious and nice solution am I missing here? Thanks.  
> >>>>>
> >>>>> The interactions with the IOMMU backend that seem relevant are
> >>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> >>>>> device is also used to tell kvm about groups as they come and go and
> >>>>> has a way to check extensions, and thus properties of the IOMMU
> >>>>> backend.  All of these are available for your {ab}use.  Thanks,  
> >>>>
> >>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
> >>>> a really bad fit.  As noted above, fundamentally it's a container we
> >>>> need to attach to the kvm-spapr-tce-table object, since what that
> >>>> represents is a guest bus DMA address space, and by definition all the
> >>>> groups in a container must have the same DMA address space.
> >>>
> >>> That's all fine and good, but the point remains that a reference to the
> >>> container is no assurance of the iommu state.  The iommu state is
> >>> maintained by the user and the groups attached to the container.  If
> >>> the groups are removed, your container reference no long has any iommu
> >>> backing and iommu_data is worthless.  The user can do this as well by
> >>> un-setting the iommu.  I understand what you're trying to do, it's just
> >>> wrong.  Thanks,
> >>
> >> I'm trying to figure out how to do this right, and it's not at all
> >> obvious.  The container may be wrong, but that doesn't have the
> >> KVM-VFIO device any more useful.  Attempting to do this at the group
> >> level is at least as wrong for the reasons I've mentioned elsewhere.
> >>
> > 
> > I could create a new fd, one per iommu_table, the fd would reference the
> > iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
> > TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
> > creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
> > I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
> > release the reference, KVM-spapr-tce-table would have "unset" ioctl()
> > or/and on every "set" I would look if all attached tables have at least one
> > iommu_table_group attached, if none - release the table.
> > 
> > This would make no change to generic VFIO code and very little change in
> > SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.
> 
> 
> Ping?

I'm still in Toronto after KVM Forum.  I had a detailed discussion
about this with Alex W, which I'll write up once I get back.

The short version is that Alex more-or-less convinced me that we do
need to go back to doing this with an interface based on linking
groups to LIOBNs.  That leads to an interface that's kind of weird and
has some fairly counter-intuitive properties, but in the end it works
out better than doing it with containers.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-08-29 13:27                       ` David Gibson
@ 2016-09-07  9:09                         ` Alexey Kardashevskiy
  2016-09-21  6:56                           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-09-07  9:09 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, linuxppc-dev, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 13914 bytes --]

On 29/08/16 23:27, David Gibson wrote:
> On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote:
>> On 18/08/16 10:22, Alexey Kardashevskiy wrote:
>>> On 17/08/16 13:17, David Gibson wrote:
>>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
>>>>> On Fri, 12 Aug 2016 15:46:01 +1000
>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>
>>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
>>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000
>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>   
>>>>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
>>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>     
>>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
>>>>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>>       
>>>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
>>>>>>>>>>>> memory while there are external users such as KVM.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>>>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
>>>>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>>>>>>>> index d1d70e0..baf6a9c 100644
>>>>>>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>>>>>>>>>  
>>>>>>>>>>>>  /**
>>>>>>>>>>>> + * External user API for containers, exported by symbols to be linked
>>>>>>>>>>>> + * dynamically.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + */
>>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	struct vfio_container *container = filep->private_data;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	if (filep->f_op != &vfio_fops)
>>>>>>>>>>>> +		return ERR_PTR(-EINVAL);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	vfio_container_get(container);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	return container;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>>>>>>>>>> +
>>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	vfio_container_put(container);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>>>>>>>>>> +
>>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	return container->iommu_data;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>>   * Sub-module support
>>>>>>>>>>>>   */
>>>>>>>>>>>>  /*
>>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>> index 3594ad3..fceea3d 100644
>>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
>>>>>>>>>>>>  };
>>>>>>>>>>>>  
>>>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>>>>>>>>>> +		u64 offset)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	struct tce_container *container = iommu_data;
>>>>>>>>>>>> +	struct iommu_table *tbl = NULL;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>>>>>>>>>> +		return NULL;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	iommu_table_get(tbl);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	return tbl;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>>>>>>>>>> +
>>>>>>>>>>>>  static int __init tce_iommu_init(void)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>>>>>>>>>  MODULE_LICENSE("GPL v2");
>>>>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
>>>>>>>>>>>> -
>>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>>>>>>>> index 0ecae0b..1c2138a 100644
>>>>>>>>>>>> --- a/include/linux/vfio.h
>>>>>>>>>>>> +++ b/include/linux/vfio.h
>>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>>>>>>>>>>>  					  unsigned long arg);
>>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>>>>>>>>>> +		struct vfio_container *container);
>>>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>>>>>>>>>> +		void *iommu_data, u64 offset);
>>>>>>>>>>>>  
>>>>>>>>>>>>  /*
>>>>>>>>>>>>   * Sub-module helpers      
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
>>>>>>>>>>> having a reference means the container itself won't go away, but only
>>>>>>>>>>> having a group set within that container holds the actual IOMMU
>>>>>>>>>>> references.  container->iommu_data is going to be NULL once the
>>>>>>>>>>> groups are lost.  Thanks,      
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Container owns the iommu tables and this is what I care about here, groups
>>>>>>>>>> attached or not - this is handled separately via IOMMU group list in a
>>>>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
>>>>>>>>>> when they are removed from a container.    
>>>>>>>>>
>>>>>>>>> The container doesn't own anything, the container is privileged by the
>>>>>>>>> groups being attached to it.  When groups are closed, they detach from
>>>>>>>>> the container and once the container group list is empty the iommu
>>>>>>>>> backend is released and iommu_data is NULL.  A container reference
>>>>>>>>> doesn't give you what you're looking for.  It implies nothing about the
>>>>>>>>> iommu backend.    
>>>>>>>>
>>>>>>>>
>>>>>>>> Well. Backend is a part of a container and since a backend owns tables, a
>>>>>>>> container owns them too.  
>>>>>>>
>>>>>>> The IOMMU backend is accessed through the container, but that backend
>>>>>>> is privileged by the groups it contains.  Once those groups are gone,
>>>>>>> the IOMMU backend is released, regardless of whatever reference you
>>>>>>> have to the container itself such as you're attempting to do here.  In
>>>>>>> that sense, the container does not own those tables.  
>>>>>>
>>>>>> So, the thing is that what KVM fundamentally needs is a handle on the
>>>>>> container.  KVM is essentially modelling the DMA address space of a
>>>>>> single guest bus, and the container is what's attached to that.
>>>>>>
>>>>>> The first part of the problem is that KVM wants to basically invoke
>>>>>> vfio_dma_map() operations without bouncing via qemu.  Because
>>>>>> vfio_dma_map() works on the container level, that's the handle that
>>>>>> KVM needs to hold.
>>>>>>
>>>>>> The second part of the problem is that in order to reduce overhead
>>>>>> further, we want to operate in real mode, which means bypassing most
>>>>>> of the usual VFIO structure and going directly(ish) from the KVM
>>>>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
>>>>>> matters a fair bit.  Because it is, explicitly, a performance hack,
>>>>>> some degree of ugliness is probably inevitable.
>>>>>>
>>>>>> Alexey - actually implementing this in two stages might make this
>>>>>> clearer.  The first stage wouldn't allow real mode, and would call
>>>>>> through the same vfio_dma_map() path as qemu calls through now.  The
>>>>>> second stage would then put in place the necessary hacks to add real
>>>>>> mode support.
>>>>>>
>>>>>>>> The problem I am trying to solve here is when KVM may release the
>>>>>>>> iommu_table objects.
>>>>>>>>
>>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
>>>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
>>>>>>>> start using tables (with referencing them).
>>>>>>>>
>>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
>>>>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
>>>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
>>>>>>>> notified that it can release the iommu_table's because the
>>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
>>>>>>>> still used by emulated devices or other containers).
>>>>>>>>
>>>>>>>> So it was suggested that we could do such "unset" somehow later assuming,
>>>>>>>> for example, on every "set" I could check if some of currently attached
>>>>>>>> containers are no more used - and this is where being able to know if there
>>>>>>>> is no backend helps - KVM remembers a container pointer and can check this
>>>>>>>> via vfio_container_get_iommu_data_ext().
>>>>>>>>
>>>>>>>> The other option would be changing vfio_container_get_ext() to take a
>>>>>>>> callback+opaque which container would call when it destroys iommu_data.
>>>>>>>> This looks more intrusive and not very intuitive how to make it right -
>>>>>>>> container would have to keep track of all registered external users and
>>>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
>>>>>>>> unregister the exact external user.  
>>>>>>>
>>>>>>> I'm not in favor of anything resembling the code above or extensions
>>>>>>> beyond it, the container is the wrong place to do this.
>>>>>>>   
>>>>>>>> Or I could store container file* in KVM. Then iommu_data would never be
>>>>>>>> released until KVM-spapr-tce-table is destroyed.  
>>>>>>>
>>>>>>> See above, holding a file pointer to the container doesn't do squat.
>>>>>>> The groups that are held by the container empower the IOMMU backend,
>>>>>>> references to the container itself don't matter.  Those references will
>>>>>>> not maintain the IOMMU data.
>>>>>>>    
>>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
>>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
>>>>>>>> still may be devices using this KVM-spapr-tce-table.
>>>>>>>>
>>>>>>>> What obvious and nice solution am I missing here? Thanks.  
>>>>>>>
>>>>>>> The interactions with the IOMMU backend that seem relevant are
>>>>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
>>>>>>> device is also used to tell kvm about groups as they come and go and
>>>>>>> has a way to check extensions, and thus properties of the IOMMU
>>>>>>> backend.  All of these are available for your {ab}use.  Thanks,  
>>>>>>
>>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
>>>>>> a really bad fit.  As noted above, fundamentally it's a container we
>>>>>> need to attach to the kvm-spapr-tce-table object, since what that
>>>>>> represents is a guest bus DMA address space, and by definition all the
>>>>>> groups in a container must have the same DMA address space.
>>>>>
>>>>> That's all fine and good, but the point remains that a reference to the
>>>>> container is no assurance of the iommu state.  The iommu state is
>>>>> maintained by the user and the groups attached to the container.  If
>>>>> the groups are removed, your container reference no long has any iommu
>>>>> backing and iommu_data is worthless.  The user can do this as well by
>>>>> un-setting the iommu.  I understand what you're trying to do, it's just
>>>>> wrong.  Thanks,
>>>>
>>>> I'm trying to figure out how to do this right, and it's not at all
>>>> obvious.  The container may be wrong, but that doesn't have the
>>>> KVM-VFIO device any more useful.  Attempting to do this at the group
>>>> level is at least as wrong for the reasons I've mentioned elsewhere.
>>>>
>>>
>>> I could create a new fd, one per iommu_table, the fd would reference the
>>> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
>>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
>>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
>>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
>>> release the reference, KVM-spapr-tce-table would have "unset" ioctl()
>>> or/and on every "set" I would look if all attached tables have at least one
>>> iommu_table_group attached, if none - release the table.
>>>
>>> This would make no change to generic VFIO code and very little change in
>>> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.
>>
>>
>> Ping?
> 
> I'm still in Toronto after KVM Forum.  I had a detailed discussion
> about this with Alex W, which I'll write up once I get back.
> 
> The short version is that Alex more-or-less convinced me that we do
> need to go back to doing this with an interface based on linking
> groups to LIOBNs.  That leads to an interface that's kind of weird and
> has some fairly counter-intuitive properties, but in the end it works
> out better than doing it with containers.
> 




Soooo? :)




-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-09-07  9:09                         ` Alexey Kardashevskiy
@ 2016-09-21  6:56                           ` Alexey Kardashevskiy
  2016-09-23  7:12                             ` David Gibson
  0 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-09-21  6:56 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, linuxppc-dev, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 14318 bytes --]

On 07/09/16 19:09, Alexey Kardashevskiy wrote:
> On 29/08/16 23:27, David Gibson wrote:
>> On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote:
>>> On 18/08/16 10:22, Alexey Kardashevskiy wrote:
>>>> On 17/08/16 13:17, David Gibson wrote:
>>>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
>>>>>> On Fri, 12 Aug 2016 15:46:01 +1000
>>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>>
>>>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
>>>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000
>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>   
>>>>>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
>>>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>     
>>>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
>>>>>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
>>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>>>       
>>>>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
>>>>>>>>>>>>> memory while there are external users such as KVM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>>>>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
>>>>>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>>>>>>>>> index d1d70e0..baf6a9c 100644
>>>>>>>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>>>>>>>>>>  
>>>>>>>>>>>>>  /**
>>>>>>>>>>>>> + * External user API for containers, exported by symbols to be linked
>>>>>>>>>>>>> + * dynamically.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +	struct vfio_container *container = filep->private_data;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	if (filep->f_op != &vfio_fops)
>>>>>>>>>>>>> +		return ERR_PTR(-EINVAL);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	vfio_container_get(container);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	return container;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +	vfio_container_put(container);
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +	return container->iommu_data;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>   * Sub-module support
>>>>>>>>>>>>>   */
>>>>>>>>>>>>>  /*
>>>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>>> index 3594ad3..fceea3d 100644
>>>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
>>>>>>>>>>>>>  };
>>>>>>>>>>>>>  
>>>>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>>>>>>>>>>> +		u64 offset)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +	struct tce_container *container = iommu_data;
>>>>>>>>>>>>> +	struct iommu_table *tbl = NULL;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>>>>>>>>>>> +		return NULL;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	iommu_table_get(tbl);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	return tbl;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>>>>>>>>>>> +
>>>>>>>>>>>>>  static int __init tce_iommu_init(void)
>>>>>>>>>>>>>  {
>>>>>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>>>>>>>>>>  MODULE_LICENSE("GPL v2");
>>>>>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
>>>>>>>>>>>>> -
>>>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>>>>>>>>> index 0ecae0b..1c2138a 100644
>>>>>>>>>>>>> --- a/include/linux/vfio.h
>>>>>>>>>>>>> +++ b/include/linux/vfio.h
>>>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>>>>>>>>>>>>  					  unsigned long arg);
>>>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>>>>>>>>>>> +		struct vfio_container *container);
>>>>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>>>>>>>>>>> +		void *iommu_data, u64 offset);
>>>>>>>>>>>>>  
>>>>>>>>>>>>>  /*
>>>>>>>>>>>>>   * Sub-module helpers      
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
>>>>>>>>>>>> having a reference means the container itself won't go away, but only
>>>>>>>>>>>> having a group set within that container holds the actual IOMMU
>>>>>>>>>>>> references.  container->iommu_data is going to be NULL once the
>>>>>>>>>>>> groups are lost.  Thanks,      
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Container owns the iommu tables and this is what I care about here, groups
>>>>>>>>>>> attached or not - this is handled separately via IOMMU group list in a
>>>>>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
>>>>>>>>>>> when they are removed from a container.    
>>>>>>>>>>
>>>>>>>>>> The container doesn't own anything, the container is privileged by the
>>>>>>>>>> groups being attached to it.  When groups are closed, they detach from
>>>>>>>>>> the container and once the container group list is empty the iommu
>>>>>>>>>> backend is released and iommu_data is NULL.  A container reference
>>>>>>>>>> doesn't give you what you're looking for.  It implies nothing about the
>>>>>>>>>> iommu backend.    
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well. Backend is a part of a container and since a backend owns tables, a
>>>>>>>>> container owns them too.  
>>>>>>>>
>>>>>>>> The IOMMU backend is accessed through the container, but that backend
>>>>>>>> is privileged by the groups it contains.  Once those groups are gone,
>>>>>>>> the IOMMU backend is released, regardless of whatever reference you
>>>>>>>> have to the container itself such as you're attempting to do here.  In
>>>>>>>> that sense, the container does not own those tables.  
>>>>>>>
>>>>>>> So, the thing is that what KVM fundamentally needs is a handle on the
>>>>>>> container.  KVM is essentially modelling the DMA address space of a
>>>>>>> single guest bus, and the container is what's attached to that.
>>>>>>>
>>>>>>> The first part of the problem is that KVM wants to basically invoke
>>>>>>> vfio_dma_map() operations without bouncing via qemu.  Because
>>>>>>> vfio_dma_map() works on the container level, that's the handle that
>>>>>>> KVM needs to hold.
>>>>>>>
>>>>>>> The second part of the problem is that in order to reduce overhead
>>>>>>> further, we want to operate in real mode, which means bypassing most
>>>>>>> of the usual VFIO structure and going directly(ish) from the KVM
>>>>>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
>>>>>>> matters a fair bit.  Because it is, explicitly, a performance hack,
>>>>>>> some degree of ugliness is probably inevitable.
>>>>>>>
>>>>>>> Alexey - actually implementing this in two stages might make this
>>>>>>> clearer.  The first stage wouldn't allow real mode, and would call
>>>>>>> through the same vfio_dma_map() path as qemu calls through now.  The
>>>>>>> second stage would then put in place the necessary hacks to add real
>>>>>>> mode support.
>>>>>>>
>>>>>>>>> The problem I am trying to solve here is when KVM may release the
>>>>>>>>> iommu_table objects.
>>>>>>>>>
>>>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
>>>>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
>>>>>>>>> start using tables (with referencing them).
>>>>>>>>>
>>>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
>>>>>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
>>>>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
>>>>>>>>> notified that it can release the iommu_table's because the
>>>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
>>>>>>>>> still used by emulated devices or other containers).
>>>>>>>>>
>>>>>>>>> So it was suggested that we could do such "unset" somehow later assuming,
>>>>>>>>> for example, on every "set" I could check if some of currently attached
>>>>>>>>> containers are no more used - and this is where being able to know if there
>>>>>>>>> is no backend helps - KVM remembers a container pointer and can check this
>>>>>>>>> via vfio_container_get_iommu_data_ext().
>>>>>>>>>
>>>>>>>>> The other option would be changing vfio_container_get_ext() to take a
>>>>>>>>> callback+opaque which container would call when it destroys iommu_data.
>>>>>>>>> This looks more intrusive and not very intuitive how to make it right -
>>>>>>>>> container would have to keep track of all registered external users and
>>>>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
>>>>>>>>> unregister the exact external user.  
>>>>>>>>
>>>>>>>> I'm not in favor of anything resembling the code above or extensions
>>>>>>>> beyond it, the container is the wrong place to do this.
>>>>>>>>   
>>>>>>>>> Or I could store container file* in KVM. Then iommu_data would never be
>>>>>>>>> released until KVM-spapr-tce-table is destroyed.  
>>>>>>>>
>>>>>>>> See above, holding a file pointer to the container doesn't do squat.
>>>>>>>> The groups that are held by the container empower the IOMMU backend,
>>>>>>>> references to the container itself don't matter.  Those references will
>>>>>>>> not maintain the IOMMU data.
>>>>>>>>    
>>>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
>>>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
>>>>>>>>> still may be devices using this KVM-spapr-tce-table.
>>>>>>>>>
>>>>>>>>> What obvious and nice solution am I missing here? Thanks.  
>>>>>>>>
>>>>>>>> The interactions with the IOMMU backend that seem relevant are
>>>>>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
>>>>>>>> device is also used to tell kvm about groups as they come and go and
>>>>>>>> has a way to check extensions, and thus properties of the IOMMU
>>>>>>>> backend.  All of these are available for your {ab}use.  Thanks,  
>>>>>>>
>>>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
>>>>>>> a really bad fit.  As noted above, fundamentally it's a container we
>>>>>>> need to attach to the kvm-spapr-tce-table object, since what that
>>>>>>> represents is a guest bus DMA address space, and by definition all the
>>>>>>> groups in a container must have the same DMA address space.
>>>>>>
>>>>>> That's all fine and good, but the point remains that a reference to the
>>>>>> container is no assurance of the iommu state.  The iommu state is
>>>>>> maintained by the user and the groups attached to the container.  If
>>>>>> the groups are removed, your container reference no long has any iommu
>>>>>> backing and iommu_data is worthless.  The user can do this as well by
>>>>>> un-setting the iommu.  I understand what you're trying to do, it's just
>>>>>> wrong.  Thanks,
>>>>>
>>>>> I'm trying to figure out how to do this right, and it's not at all
>>>>> obvious.  The container may be wrong, but that doesn't have the
>>>>> KVM-VFIO device any more useful.  Attempting to do this at the group
>>>>> level is at least as wrong for the reasons I've mentioned elsewhere.
>>>>>
>>>>
>>>> I could create a new fd, one per iommu_table, the fd would reference the
>>>> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
>>>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
>>>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
>>>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
>>>> release the reference, KVM-spapr-tce-table would have "unset" ioctl()
>>>> or/and on every "set" I would look if all attached tables have at least one
>>>> iommu_table_group attached, if none - release the table.
>>>>
>>>> This would make no change to generic VFIO code and very little change in
>>>> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.
>>>
>>>
>>> Ping?
>>
>> I'm still in Toronto after KVM Forum.  I had a detailed discussion
>> about this with Alex W, which I'll write up once I get back.
>>
>> The short version is that Alex more-or-less convinced me that we do
>> need to go back to doing this with an interface based on linking
>> groups to LIOBNs.  That leads to an interface that's kind of weird and
>> has some fairly counter-intuitive properties, but in the end it works
>> out better than doing it with containers.
>>
> 
> 
> 
> 
> Soooo? :)


When can I expect a full version of how to do this in-kernel thingy? Thanks.




-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-09-21  6:56                           ` Alexey Kardashevskiy
@ 2016-09-23  7:12                             ` David Gibson
  2016-10-17  6:06                               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 60+ messages in thread
From: David Gibson @ 2016-09-23  7:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, linuxppc-dev, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 18254 bytes --]

On Wed, Sep 21, 2016 at 04:56:52PM +1000, Alexey Kardashevskiy wrote:
> On 07/09/16 19:09, Alexey Kardashevskiy wrote:
> > On 29/08/16 23:27, David Gibson wrote:
> >> On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote:
> >>> On 18/08/16 10:22, Alexey Kardashevskiy wrote:
> >>>> On 17/08/16 13:17, David Gibson wrote:
> >>>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
> >>>>>> On Fri, 12 Aug 2016 15:46:01 +1000
> >>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>>
> >>>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> >>>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000
> >>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>>   
> >>>>>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
> >>>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
> >>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>>>>     
> >>>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
> >>>>>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
> >>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>>>>>>       
> >>>>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
> >>>>>>>>>>>>> memory while there are external users such as KVM.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> >>>>>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> >>>>>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
> >>>>>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>>>>>>>>>>> index d1d70e0..baf6a9c 100644
> >>>>>>>>>>>>> --- a/drivers/vfio/vfio.c
> >>>>>>>>>>>>> +++ b/drivers/vfio/vfio.c
> >>>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> >>>>>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> >>>>>>>>>>>>>  
> >>>>>>>>>>>>>  /**
> >>>>>>>>>>>>> + * External user API for containers, exported by symbols to be linked
> >>>>>>>>>>>>> + * dynamically.
> >>>>>>>>>>>>> + *
> >>>>>>>>>>>>> + */
> >>>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> +	struct vfio_container *container = filep->private_data;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	if (filep->f_op != &vfio_fops)
> >>>>>>>>>>>>> +		return ERR_PTR(-EINVAL);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	vfio_container_get(container);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	return container;
> >>>>>>>>>>>>> +}
> >>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> +	vfio_container_put(container);
> >>>>>>>>>>>>> +}
> >>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> +	return container->iommu_data;
> >>>>>>>>>>>>> +}
> >>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +/**
> >>>>>>>>>>>>>   * Sub-module support
> >>>>>>>>>>>>>   */
> >>>>>>>>>>>>>  /*
> >>>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>> index 3594ad3..fceea3d 100644
> >>>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> >>>>>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
> >>>>>>>>>>>>>  };
> >>>>>>>>>>>>>  
> >>>>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> >>>>>>>>>>>>> +		u64 offset)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> +	struct tce_container *container = iommu_data;
> >>>>>>>>>>>>> +	struct iommu_table *tbl = NULL;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> >>>>>>>>>>>>> +		return NULL;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	iommu_table_get(tbl);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	return tbl;
> >>>>>>>>>>>>> +}
> >>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>>  static int __init tce_iommu_init(void)
> >>>>>>>>>>>>>  {
> >>>>>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> >>>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> >>>>>>>>>>>>>  MODULE_LICENSE("GPL v2");
> >>>>>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> >>>>>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> >>>>>>>>>>>>> -
> >>>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>>>>>>>>>>> index 0ecae0b..1c2138a 100644
> >>>>>>>>>>>>> --- a/include/linux/vfio.h
> >>>>>>>>>>>>> +++ b/include/linux/vfio.h
> >>>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> >>>>>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >>>>>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> >>>>>>>>>>>>>  					  unsigned long arg);
> >>>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> >>>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> >>>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
> >>>>>>>>>>>>> +		struct vfio_container *container);
> >>>>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> >>>>>>>>>>>>> +		void *iommu_data, u64 offset);
> >>>>>>>>>>>>>  
> >>>>>>>>>>>>>  /*
> >>>>>>>>>>>>>   * Sub-module helpers      
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
> >>>>>>>>>>>> having a reference means the container itself won't go away, but only
> >>>>>>>>>>>> having a group set within that container holds the actual IOMMU
> >>>>>>>>>>>> references.  container->iommu_data is going to be NULL once the
> >>>>>>>>>>>> groups are lost.  Thanks,      
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Container owns the iommu tables and this is what I care about here, groups
> >>>>>>>>>>> attached or not - this is handled separately via IOMMU group list in a
> >>>>>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
> >>>>>>>>>>> when they are removed from a container.    
> >>>>>>>>>>
> >>>>>>>>>> The container doesn't own anything, the container is privileged by the
> >>>>>>>>>> groups being attached to it.  When groups are closed, they detach from
> >>>>>>>>>> the container and once the container group list is empty the iommu
> >>>>>>>>>> backend is released and iommu_data is NULL.  A container reference
> >>>>>>>>>> doesn't give you what you're looking for.  It implies nothing about the
> >>>>>>>>>> iommu backend.    
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Well. Backend is a part of a container and since a backend owns tables, a
> >>>>>>>>> container owns them too.  
> >>>>>>>>
> >>>>>>>> The IOMMU backend is accessed through the container, but that backend
> >>>>>>>> is privileged by the groups it contains.  Once those groups are gone,
> >>>>>>>> the IOMMU backend is released, regardless of whatever reference you
> >>>>>>>> have to the container itself such as you're attempting to do here.  In
> >>>>>>>> that sense, the container does not own those tables.  
> >>>>>>>
> >>>>>>> So, the thing is that what KVM fundamentally needs is a handle on the
> >>>>>>> container.  KVM is essentially modelling the DMA address space of a
> >>>>>>> single guest bus, and the container is what's attached to that.
> >>>>>>>
> >>>>>>> The first part of the problem is that KVM wants to basically invoke
> >>>>>>> vfio_dma_map() operations without bouncing via qemu.  Because
> >>>>>>> vfio_dma_map() works on the container level, that's the handle that
> >>>>>>> KVM needs to hold.
> >>>>>>>
> >>>>>>> The second part of the problem is that in order to reduce overhead
> >>>>>>> further, we want to operate in real mode, which means bypassing most
> >>>>>>> of the usual VFIO structure and going directly(ish) from the KVM
> >>>>>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
> >>>>>>> matters a fair bit.  Because it is, explicitly, a performance hack,
> >>>>>>> some degree of ugliness is probably inevitable.
> >>>>>>>
> >>>>>>> Alexey - actually implementing this in two stages might make this
> >>>>>>> clearer.  The first stage wouldn't allow real mode, and would call
> >>>>>>> through the same vfio_dma_map() path as qemu calls through now.  The
> >>>>>>> second stage would then put in place the necessary hacks to add real
> >>>>>>> mode support.
> >>>>>>>
> >>>>>>>>> The problem I am trying to solve here is when KVM may release the
> >>>>>>>>> iommu_table objects.
> >>>>>>>>>
> >>>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> >>>>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
> >>>>>>>>> start using tables (with referencing them).
> >>>>>>>>>
> >>>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> >>>>>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
> >>>>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
> >>>>>>>>> notified that it can release the iommu_table's because the
> >>>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> >>>>>>>>> still used by emulated devices or other containers).
> >>>>>>>>>
> >>>>>>>>> So it was suggested that we could do such "unset" somehow later assuming,
> >>>>>>>>> for example, on every "set" I could check if some of currently attached
> >>>>>>>>> containers are no more used - and this is where being able to know if there
> >>>>>>>>> is no backend helps - KVM remembers a container pointer and can check this
> >>>>>>>>> via vfio_container_get_iommu_data_ext().
> >>>>>>>>>
> >>>>>>>>> The other option would be changing vfio_container_get_ext() to take a
> >>>>>>>>> callback+opaque which container would call when it destroys iommu_data.
> >>>>>>>>> This looks more intrusive and not very intuitive how to make it right -
> >>>>>>>>> container would have to keep track of all registered external users and
> >>>>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
> >>>>>>>>> unregister the exact external user.  
> >>>>>>>>
> >>>>>>>> I'm not in favor of anything resembling the code above or extensions
> >>>>>>>> beyond it, the container is the wrong place to do this.
> >>>>>>>>   
> >>>>>>>>> Or I could store container file* in KVM. Then iommu_data would never be
> >>>>>>>>> released until KVM-spapr-tce-table is destroyed.  
> >>>>>>>>
> >>>>>>>> See above, holding a file pointer to the container doesn't do squat.
> >>>>>>>> The groups that are held by the container empower the IOMMU backend,
> >>>>>>>> references to the container itself don't matter.  Those references will
> >>>>>>>> not maintain the IOMMU data.
> >>>>>>>>    
> >>>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> >>>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
> >>>>>>>>> still may be devices using this KVM-spapr-tce-table.
> >>>>>>>>>
> >>>>>>>>> What obvious and nice solution am I missing here? Thanks.  
> >>>>>>>>
> >>>>>>>> The interactions with the IOMMU backend that seem relevant are
> >>>>>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> >>>>>>>> device is also used to tell kvm about groups as they come and go and
> >>>>>>>> has a way to check extensions, and thus properties of the IOMMU
> >>>>>>>> backend.  All of these are available for your {ab}use.  Thanks,  
> >>>>>>>
> >>>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
> >>>>>>> a really bad fit.  As noted above, fundamentally it's a container we
> >>>>>>> need to attach to the kvm-spapr-tce-table object, since what that
> >>>>>>> represents is a guest bus DMA address space, and by definition all the
> >>>>>>> groups in a container must have the same DMA address space.
> >>>>>>
> >>>>>> That's all fine and good, but the point remains that a reference to the
> >>>>>> container is no assurance of the iommu state.  The iommu state is
> >>>>>> maintained by the user and the groups attached to the container.  If
> >>>>>> the groups are removed, your container reference no long has any iommu
> >>>>>> backing and iommu_data is worthless.  The user can do this as well by
> >>>>>> un-setting the iommu.  I understand what you're trying to do, it's just
> >>>>>> wrong.  Thanks,
> >>>>>
> >>>>> I'm trying to figure out how to do this right, and it's not at all
> >>>>> obvious.  The container may be wrong, but that doesn't have the
> >>>>> KVM-VFIO device any more useful.  Attempting to do this at the group
> >>>>> level is at least as wrong for the reasons I've mentioned elsewhere.
> >>>>>
> >>>>
> >>>> I could create a new fd, one per iommu_table, the fd would reference the
> >>>> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
> >>>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
> >>>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
> >>>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
> >>>> release the reference, KVM-spapr-tce-table would have "unset" ioctl()
> >>>> or/and on every "set" I would look if all attached tables have at least one
> >>>> iommu_table_group attached, if none - release the table.
> >>>>
> >>>> This would make no change to generic VFIO code and very little change in
> >>>> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.
> >>>
> >>>
> >>> Ping?
> >>
> >> I'm still in Toronto after KVM Forum.  I had a detailed discussion
> >> about this with Alex W, which I'll write up once I get back.
> >>
> >> The short version is that Alex more-or-less convinced me that we do
> >> need to go back to doing this with an interface based on linking
> >> groups to LIOBNs.  That leads to an interface that's kind of weird and
> >> has some fairly counter-intuitive properties, but in the end it works
> >> out better than doing it with containers.
> >>
> > 
> > 
> > 
> > 
> > Soooo? :)
> 
> 
> When can I expect a full version of how to do this in-kernel thingy?
> Thanks.

When I can dig myself out from under other things in my queue.  Which
turns out to be now.

Ok.. here's hoping I can remember enough of the conclusions I came to
with Alex W.

User <-> KVM interface
----------------------

This needs to take an LIOBN and a group fd and associate (or
disassociate) them.  This should be possible to do by adding each
group to the vfio-kvm device as on x86, then setting an attribute on
the device to mark the associated liobn.

Attaching different (overlapping) LIOBNs to different groups in the
same container is boundedly undefined (i.e. it mustn't break the host,
but can do anything to the guest).

KVM <-> VFIO (in kernel) interface
----------------------------------

You'll need a special function which takes a vfio group fd and returns
a reference to an iommu_table object.  It would also return an error
if the group isn't backed by the spapr_tce iommu driver (including any
calls to it on a non-ppc host).  This should probably also increment
the iommu table's ref count (on success).

Implementation notes
--------------------

When a device in a new group is hotplugged, qemu would need to add the
group to the container *then* tell KVM to attach the group to the
correct liobn(s).

KVM would add the group to a list for that liobn.  It would call the
vfio hook to get the associated iommu table.  If there's an error,
then it's unable to enable acceleration, and would either return an
error immediately or ensure that later attempts to PUT_TCE will be
punted to qemu.

Assuming it is able to accelerate, it would add the iommu table to a
list of iommu tables associated with the liobn.  It will need to
de-dupe here, since with multiple groups per container you'd expect
multiple groups with the same iommu table.

H_PUT_TCE would walk the list of attached iommu tables and update them
using the ppc kernel iommu interfaces.

When a group is removed from a liobn, kvm would need to recalculate
the list of iommu tables, in case that was the last group attached to
the table.  It would need to decrement the refcount on the iommu table
and, obviously, make sure everything is sychronized with the real mode
PUT_TCE.

When a group is hot unplugged, it's qemu's resposibility to tell kvm
that the group is no longer associated with the liobn, before it
removes the group from the container.  If it doesn't there may be a
stale iommu table attached to the liobn.  That could certainly mess up
DMA on the guest for other devices, but shouldn't damage the host -
the group now belongs to the host again, but because the group was
detached from the container, the HW is no longer using the container's
iommu table (which KVM is touching) to actually serve the group.

If all the groups are unplugged, so the container becomes quiescent,
KVM's refcount(s) on the iommu table stop it going away.  It won't be
looked at by the hardware any more, so updates will be useless, but
again that's only a problem for the guest, not the host.


Hope that covers it.

Alex, please let me know if I missed something from our discussion.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-09-23  7:12                             ` David Gibson
@ 2016-10-17  6:06                               ` Alexey Kardashevskiy
  2016-10-18  1:42                                 ` David Gibson
  0 siblings, 1 reply; 60+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-17  6:06 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, linuxppc-dev, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 18464 bytes --]

So far I got one question, below.


On 23/09/16 17:12, David Gibson wrote:
> On Wed, Sep 21, 2016 at 04:56:52PM +1000, Alexey Kardashevskiy wrote:
>> On 07/09/16 19:09, Alexey Kardashevskiy wrote:
>>> On 29/08/16 23:27, David Gibson wrote:
>>>> On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote:
>>>>> On 18/08/16 10:22, Alexey Kardashevskiy wrote:
>>>>>> On 17/08/16 13:17, David Gibson wrote:
>>>>>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
>>>>>>>> On Fri, 12 Aug 2016 15:46:01 +1000
>>>>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
>>>>>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000
>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>   
>>>>>>>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
>>>>>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
>>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>>>     
>>>>>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
>>>>>>>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
>>>>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
>>>>>>>>>>>>>>> memory while there are external users such as KVM.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>>>>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
>>>>>>>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>>>>>>>>>>> index d1d70e0..baf6a9c 100644
>>>>>>>>>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>>>>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>>>>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>  /**
>>>>>>>>>>>>>>> + * External user API for containers, exported by symbols to be linked
>>>>>>>>>>>>>>> + * dynamically.
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	struct vfio_container *container = filep->private_data;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	if (filep->f_op != &vfio_fops)
>>>>>>>>>>>>>>> +		return ERR_PTR(-EINVAL);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	vfio_container_get(container);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	return container;
>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	vfio_container_put(container);
>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	return container->iommu_data;
>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>   * Sub-module support
>>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>>  /*
>>>>>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>>>>> index 3594ad3..fceea3d 100644
>>>>>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>>>>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
>>>>>>>>>>>>>>>  };
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>>>>>>>>>>>>> +		u64 offset)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	struct tce_container *container = iommu_data;
>>>>>>>>>>>>>>> +	struct iommu_table *tbl = NULL;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>>>>>>>>>>>>> +		return NULL;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	iommu_table_get(tbl);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	return tbl;
>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>  static int __init tce_iommu_init(void)
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>>>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>>>>>>>>>>>>  MODULE_LICENSE("GPL v2");
>>>>>>>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>>>>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>>>>>>>>>>> index 0ecae0b..1c2138a 100644
>>>>>>>>>>>>>>> --- a/include/linux/vfio.h
>>>>>>>>>>>>>>> +++ b/include/linux/vfio.h
>>>>>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>>>>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>>>>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
>>>>>>>>>>>>>>>  					  unsigned long arg);
>>>>>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>>>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>>>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>>>>>>>>>>>>> +		struct vfio_container *container);
>>>>>>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>>>>>>>>>>>>> +		void *iommu_data, u64 offset);
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>  /*
>>>>>>>>>>>>>>>   * Sub-module helpers      
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
>>>>>>>>>>>>>> having a reference means the container itself won't go away, but only
>>>>>>>>>>>>>> having a group set within that container holds the actual IOMMU
>>>>>>>>>>>>>> references.  container->iommu_data is going to be NULL once the
>>>>>>>>>>>>>> groups are lost.  Thanks,      
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Container owns the iommu tables and this is what I care about here, groups
>>>>>>>>>>>>> attached or not - this is handled separately via IOMMU group list in a
>>>>>>>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
>>>>>>>>>>>>> when they are removed from a container.    
>>>>>>>>>>>>
>>>>>>>>>>>> The container doesn't own anything, the container is privileged by the
>>>>>>>>>>>> groups being attached to it.  When groups are closed, they detach from
>>>>>>>>>>>> the container and once the container group list is empty the iommu
>>>>>>>>>>>> backend is released and iommu_data is NULL.  A container reference
>>>>>>>>>>>> doesn't give you what you're looking for.  It implies nothing about the
>>>>>>>>>>>> iommu backend.    
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Well. Backend is a part of a container and since a backend owns tables, a
>>>>>>>>>>> container owns them too.  
>>>>>>>>>>
>>>>>>>>>> The IOMMU backend is accessed through the container, but that backend
>>>>>>>>>> is privileged by the groups it contains.  Once those groups are gone,
>>>>>>>>>> the IOMMU backend is released, regardless of whatever reference you
>>>>>>>>>> have to the container itself such as you're attempting to do here.  In
>>>>>>>>>> that sense, the container does not own those tables.  
>>>>>>>>>
>>>>>>>>> So, the thing is that what KVM fundamentally needs is a handle on the
>>>>>>>>> container.  KVM is essentially modelling the DMA address space of a
>>>>>>>>> single guest bus, and the container is what's attached to that.
>>>>>>>>>
>>>>>>>>> The first part of the problem is that KVM wants to basically invoke
>>>>>>>>> vfio_dma_map() operations without bouncing via qemu.  Because
>>>>>>>>> vfio_dma_map() works on the container level, that's the handle that
>>>>>>>>> KVM needs to hold.
>>>>>>>>>
>>>>>>>>> The second part of the problem is that in order to reduce overhead
>>>>>>>>> further, we want to operate in real mode, which means bypassing most
>>>>>>>>> of the usual VFIO structure and going directly(ish) from the KVM
>>>>>>>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
>>>>>>>>> matters a fair bit.  Because it is, explicitly, a performance hack,
>>>>>>>>> some degree of ugliness is probably inevitable.
>>>>>>>>>
>>>>>>>>> Alexey - actually implementing this in two stages might make this
>>>>>>>>> clearer.  The first stage wouldn't allow real mode, and would call
>>>>>>>>> through the same vfio_dma_map() path as qemu calls through now.  The
>>>>>>>>> second stage would then put in place the necessary hacks to add real
>>>>>>>>> mode support.
>>>>>>>>>
>>>>>>>>>>> The problem I am trying to solve here is when KVM may release the
>>>>>>>>>>> iommu_table objects.
>>>>>>>>>>>
>>>>>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
>>>>>>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
>>>>>>>>>>> start using tables (with referencing them).
>>>>>>>>>>>
>>>>>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
>>>>>>>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
>>>>>>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
>>>>>>>>>>> notified that it can release the iommu_table's because the
>>>>>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
>>>>>>>>>>> still used by emulated devices or other containers).
>>>>>>>>>>>
>>>>>>>>>>> So it was suggested that we could do such "unset" somehow later assuming,
>>>>>>>>>>> for example, on every "set" I could check if some of currently attached
>>>>>>>>>>> containers are no more used - and this is where being able to know if there
>>>>>>>>>>> is no backend helps - KVM remembers a container pointer and can check this
>>>>>>>>>>> via vfio_container_get_iommu_data_ext().
>>>>>>>>>>>
>>>>>>>>>>> The other option would be changing vfio_container_get_ext() to take a
>>>>>>>>>>> callback+opaque which container would call when it destroys iommu_data.
>>>>>>>>>>> This looks more intrusive and not very intuitive how to make it right -
>>>>>>>>>>> container would have to keep track of all registered external users and
>>>>>>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
>>>>>>>>>>> unregister the exact external user.  
>>>>>>>>>>
>>>>>>>>>> I'm not in favor of anything resembling the code above or extensions
>>>>>>>>>> beyond it, the container is the wrong place to do this.
>>>>>>>>>>   
>>>>>>>>>>> Or I could store container file* in KVM. Then iommu_data would never be
>>>>>>>>>>> released until KVM-spapr-tce-table is destroyed.  
>>>>>>>>>>
>>>>>>>>>> See above, holding a file pointer to the container doesn't do squat.
>>>>>>>>>> The groups that are held by the container empower the IOMMU backend,
>>>>>>>>>> references to the container itself don't matter.  Those references will
>>>>>>>>>> not maintain the IOMMU data.
>>>>>>>>>>    
>>>>>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
>>>>>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
>>>>>>>>>>> still may be devices using this KVM-spapr-tce-table.
>>>>>>>>>>>
>>>>>>>>>>> What obvious and nice solution am I missing here? Thanks.  
>>>>>>>>>>
>>>>>>>>>> The interactions with the IOMMU backend that seem relevant are
>>>>>>>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
>>>>>>>>>> device is also used to tell kvm about groups as they come and go and
>>>>>>>>>> has a way to check extensions, and thus properties of the IOMMU
>>>>>>>>>> backend.  All of these are available for your {ab}use.  Thanks,  
>>>>>>>>>
>>>>>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
>>>>>>>>> a really bad fit.  As noted above, fundamentally it's a container we
>>>>>>>>> need to attach to the kvm-spapr-tce-table object, since what that
>>>>>>>>> represents is a guest bus DMA address space, and by definition all the
>>>>>>>>> groups in a container must have the same DMA address space.
>>>>>>>>
>>>>>>>> That's all fine and good, but the point remains that a reference to the
>>>>>>>> container is no assurance of the iommu state.  The iommu state is
>>>>>>>> maintained by the user and the groups attached to the container.  If
>>>>>>>> the groups are removed, your container reference no long has any iommu
>>>>>>>> backing and iommu_data is worthless.  The user can do this as well by
>>>>>>>> un-setting the iommu.  I understand what you're trying to do, it's just
>>>>>>>> wrong.  Thanks,
>>>>>>>
>>>>>>> I'm trying to figure out how to do this right, and it's not at all
>>>>>>> obvious.  The container may be wrong, but that doesn't have the
>>>>>>> KVM-VFIO device any more useful.  Attempting to do this at the group
>>>>>>> level is at least as wrong for the reasons I've mentioned elsewhere.
>>>>>>>
>>>>>>
>>>>>> I could create a new fd, one per iommu_table, the fd would reference the
>>>>>> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
>>>>>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
>>>>>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
>>>>>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
>>>>>> release the reference, KVM-spapr-tce-table would have "unset" ioctl()
>>>>>> or/and on every "set" I would look if all attached tables have at least one
>>>>>> iommu_table_group attached, if none - release the table.
>>>>>>
>>>>>> This would make no change to generic VFIO code and very little change in
>>>>>> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.
>>>>>
>>>>>
>>>>> Ping?
>>>>
>>>> I'm still in Toronto after KVM Forum.  I had a detailed discussion
>>>> about this with Alex W, which I'll write up once I get back.
>>>>
>>>> The short version is that Alex more-or-less convinced me that we do
>>>> need to go back to doing this with an interface based on linking
>>>> groups to LIOBNs.  That leads to an interface that's kind of weird and
>>>> has some fairly counter-intuitive properties, but in the end it works
>>>> out better than doing it with containers.
>>>>
>>>
>>>
>>>
>>>
>>> Soooo? :)
>>
>>
>> When can I expect a full version of how to do this in-kernel thingy?
>> Thanks.
> 
> When I can dig myself out from under other things in my queue.  Which
> turns out to be now.
> 
> Ok.. here's hoping I can remember enough of the conclusions I came to
> with Alex W.
> 
> User <-> KVM interface
> ----------------------
> 
> This needs to take an LIOBN and a group fd and associate (or
> disassociate) them.  This should be possible to do by adding each
> group to the vfio-kvm device as on x86, then setting an attribute on
> the device to mark the associated liobn.
> 
> Attaching different (overlapping) LIOBNs to different groups in the
> same container is boundedly undefined (i.e. it mustn't break the host,
> but can do anything to the guest).
> 
> KVM <-> VFIO (in kernel) interface
> ----------------------------------
> 
> You'll need a special function which takes a vfio group fd and returns
> a reference to an iommu_table object.  It would also return an error
> if the group isn't backed by the spapr_tce iommu driver (including any
> calls to it on a non-ppc host).  This should probably also increment
> the iommu table's ref count (on success).
> 
> Implementation notes
> --------------------
> 
> When a device in a new group is hotplugged, qemu would need to add the
> group to the container *then* tell KVM to attach the group to the
> correct liobn(s).
> 
> KVM would add the group to a list for that liobn.  It would call the
> vfio hook to get the associated iommu table.  If there's an error,
> then it's unable to enable acceleration, and would either return an
> error immediately or ensure that later attempts to PUT_TCE will be
> punted to qemu.
> 
> Assuming it is able to accelerate, it would add the iommu table to a
> list of iommu tables associated with the liobn.  It will need to
> de-dupe here, since with multiple groups per container you'd expect
> multiple groups with the same iommu table.
> 
> H_PUT_TCE would walk the list of attached iommu tables and update them
> using the ppc kernel iommu interfaces.
> 
> When a group is removed from a liobn, kvm would need to recalculate
> the list of iommu tables, in case that was the last group attached to
> the table.  It would need to decrement the refcount on the iommu table
> and, obviously, make sure everything is sychronized with the real mode
> PUT_TCE.
> 
> When a group is hot unplugged, it's qemu's resposibility to tell kvm
> that the group is no longer associated with the liobn, before it
> removes the group from the container. 


Cannot VFIO KVM device just release this extra reference when QEMU calls
KVM_DEV_VFIO_GROUP_DEL from  vfio_instance_finalize->vfio_put_group?



> If it doesn't there may be a
> stale iommu table attached to the liobn.  That could certainly mess up
> DMA on the guest for other devices, but shouldn't damage the host -
> the group now belongs to the host again, but because the group was
> detached from the container, the HW is no longer using the container's
> iommu table (which KVM is touching) to actually serve the group.
> 
> If all the groups are unplugged, so the container becomes quiescent,
> KVM's refcount(s) on the iommu table stop it going away.  It won't be
> looked at by the hardware any more, so updates will be useless, but
> again that's only a problem for the guest, not the host.
> 
> 
> Hope that covers it.
> 
> Alex, please let me know if I missed something from our discussion.
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
  2016-10-17  6:06                               ` Alexey Kardashevskiy
@ 2016-10-18  1:42                                 ` David Gibson
  0 siblings, 0 replies; 60+ messages in thread
From: David Gibson @ 2016-10-18  1:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, linuxppc-dev, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 19696 bytes --]

On Mon, Oct 17, 2016 at 05:06:28PM +1100, Alexey Kardashevskiy wrote:
> So far I got one question, below.
> 
> 
> On 23/09/16 17:12, David Gibson wrote:
> > On Wed, Sep 21, 2016 at 04:56:52PM +1000, Alexey Kardashevskiy wrote:
> >> On 07/09/16 19:09, Alexey Kardashevskiy wrote:
> >>> On 29/08/16 23:27, David Gibson wrote:
> >>>> On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote:
> >>>>> On 18/08/16 10:22, Alexey Kardashevskiy wrote:
> >>>>>> On 17/08/16 13:17, David Gibson wrote:
> >>>>>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
> >>>>>>>> On Fri, 12 Aug 2016 15:46:01 +1000
> >>>>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>>>>
> >>>>>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> >>>>>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000
> >>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>>>>   
> >>>>>>>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
> >>>>>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
> >>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>>>>>>     
> >>>>>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
> >>>>>>>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
> >>>>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>>>>>>>>       
> >>>>>>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
> >>>>>>>>>>>>>>> memory while there are external users such as KVM.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> >>>>>>>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> >>>>>>>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
> >>>>>>>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>>>>>>>>>>>>> index d1d70e0..baf6a9c 100644
> >>>>>>>>>>>>>>> --- a/drivers/vfio/vfio.c
> >>>>>>>>>>>>>>> +++ b/drivers/vfio/vfio.c
> >>>>>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> >>>>>>>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> >>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>  /**
> >>>>>>>>>>>>>>> + * External user API for containers, exported by symbols to be linked
> >>>>>>>>>>>>>>> + * dynamically.
> >>>>>>>>>>>>>>> + *
> >>>>>>>>>>>>>>> + */
> >>>>>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> >>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>> +	struct vfio_container *container = filep->private_data;
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +	if (filep->f_op != &vfio_fops)
> >>>>>>>>>>>>>>> +		return ERR_PTR(-EINVAL);
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +	vfio_container_get(container);
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +	return container;
> >>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
> >>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>> +	vfio_container_put(container);
> >>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> >>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>> +	return container->iommu_data;
> >>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +/**
> >>>>>>>>>>>>>>>   * Sub-module support
> >>>>>>>>>>>>>>>   */
> >>>>>>>>>>>>>>>  /*
> >>>>>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>>>> index 3594ad3..fceea3d 100644
> >>>>>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> >>>>>>>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
> >>>>>>>>>>>>>>>  };
> >>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> >>>>>>>>>>>>>>> +		u64 offset)
> >>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>> +	struct tce_container *container = iommu_data;
> >>>>>>>>>>>>>>> +	struct iommu_table *tbl = NULL;
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> >>>>>>>>>>>>>>> +		return NULL;
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +	iommu_table_get(tbl);
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> +	return tbl;
> >>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>  static int __init tce_iommu_init(void)
> >>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> >>>>>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> >>>>>>>>>>>>>>>  MODULE_LICENSE("GPL v2");
> >>>>>>>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> >>>>>>>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> >>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>>>>>>>>>>>>> index 0ecae0b..1c2138a 100644
> >>>>>>>>>>>>>>> --- a/include/linux/vfio.h
> >>>>>>>>>>>>>>> +++ b/include/linux/vfio.h
> >>>>>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> >>>>>>>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >>>>>>>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> >>>>>>>>>>>>>>>  					  unsigned long arg);
> >>>>>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> >>>>>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> >>>>>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
> >>>>>>>>>>>>>>> +		struct vfio_container *container);
> >>>>>>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> >>>>>>>>>>>>>>> +		void *iommu_data, u64 offset);
> >>>>>>>>>>>>>>>  
> >>>>>>>>>>>>>>>  /*
> >>>>>>>>>>>>>>>   * Sub-module helpers      
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
> >>>>>>>>>>>>>> having a reference means the container itself won't go away, but only
> >>>>>>>>>>>>>> having a group set within that container holds the actual IOMMU
> >>>>>>>>>>>>>> references.  container->iommu_data is going to be NULL once the
> >>>>>>>>>>>>>> groups are lost.  Thanks,      
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Container owns the iommu tables and this is what I care about here, groups
> >>>>>>>>>>>>> attached or not - this is handled separately via IOMMU group list in a
> >>>>>>>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
> >>>>>>>>>>>>> when they are removed from a container.    
> >>>>>>>>>>>>
> >>>>>>>>>>>> The container doesn't own anything, the container is privileged by the
> >>>>>>>>>>>> groups being attached to it.  When groups are closed, they detach from
> >>>>>>>>>>>> the container and once the container group list is empty the iommu
> >>>>>>>>>>>> backend is released and iommu_data is NULL.  A container reference
> >>>>>>>>>>>> doesn't give you what you're looking for.  It implies nothing about the
> >>>>>>>>>>>> iommu backend.    
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Well. Backend is a part of a container and since a backend owns tables, a
> >>>>>>>>>>> container owns them too.  
> >>>>>>>>>>
> >>>>>>>>>> The IOMMU backend is accessed through the container, but that backend
> >>>>>>>>>> is privileged by the groups it contains.  Once those groups are gone,
> >>>>>>>>>> the IOMMU backend is released, regardless of whatever reference you
> >>>>>>>>>> have to the container itself such as you're attempting to do here.  In
> >>>>>>>>>> that sense, the container does not own those tables.  
> >>>>>>>>>
> >>>>>>>>> So, the thing is that what KVM fundamentally needs is a handle on the
> >>>>>>>>> container.  KVM is essentially modelling the DMA address space of a
> >>>>>>>>> single guest bus, and the container is what's attached to that.
> >>>>>>>>>
> >>>>>>>>> The first part of the problem is that KVM wants to basically invoke
> >>>>>>>>> vfio_dma_map() operations without bouncing via qemu.  Because
> >>>>>>>>> vfio_dma_map() works on the container level, that's the handle that
> >>>>>>>>> KVM needs to hold.
> >>>>>>>>>
> >>>>>>>>> The second part of the problem is that in order to reduce overhead
> >>>>>>>>> further, we want to operate in real mode, which means bypassing most
> >>>>>>>>> of the usual VFIO structure and going directly(ish) from the KVM
> >>>>>>>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
> >>>>>>>>> matters a fair bit.  Because it is, explicitly, a performance hack,
> >>>>>>>>> some degree of ugliness is probably inevitable.
> >>>>>>>>>
> >>>>>>>>> Alexey - actually implementing this in two stages might make this
> >>>>>>>>> clearer.  The first stage wouldn't allow real mode, and would call
> >>>>>>>>> through the same vfio_dma_map() path as qemu calls through now.  The
> >>>>>>>>> second stage would then put in place the necessary hacks to add real
> >>>>>>>>> mode support.
> >>>>>>>>>
> >>>>>>>>>>> The problem I am trying to solve here is when KVM may release the
> >>>>>>>>>>> iommu_table objects.
> >>>>>>>>>>>
> >>>>>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> >>>>>>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
> >>>>>>>>>>> start using tables (with referencing them).
> >>>>>>>>>>>
> >>>>>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> >>>>>>>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
> >>>>>>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
> >>>>>>>>>>> notified that it can release the iommu_table's because the
> >>>>>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> >>>>>>>>>>> still used by emulated devices or other containers).
> >>>>>>>>>>>
> >>>>>>>>>>> So it was suggested that we could do such "unset" somehow later assuming,
> >>>>>>>>>>> for example, on every "set" I could check if some of currently attached
> >>>>>>>>>>> containers are no more used - and this is where being able to know if there
> >>>>>>>>>>> is no backend helps - KVM remembers a container pointer and can check this
> >>>>>>>>>>> via vfio_container_get_iommu_data_ext().
> >>>>>>>>>>>
> >>>>>>>>>>> The other option would be changing vfio_container_get_ext() to take a
> >>>>>>>>>>> callback+opaque which container would call when it destroys iommu_data.
> >>>>>>>>>>> This looks more intrusive and not very intuitive how to make it right -
> >>>>>>>>>>> container would have to keep track of all registered external users and
> >>>>>>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
> >>>>>>>>>>> unregister the exact external user.  
> >>>>>>>>>>
> >>>>>>>>>> I'm not in favor of anything resembling the code above or extensions
> >>>>>>>>>> beyond it, the container is the wrong place to do this.
> >>>>>>>>>>   
> >>>>>>>>>>> Or I could store container file* in KVM. Then iommu_data would never be
> >>>>>>>>>>> released until KVM-spapr-tce-table is destroyed.  
> >>>>>>>>>>
> >>>>>>>>>> See above, holding a file pointer to the container doesn't do squat.
> >>>>>>>>>> The groups that are held by the container empower the IOMMU backend,
> >>>>>>>>>> references to the container itself don't matter.  Those references will
> >>>>>>>>>> not maintain the IOMMU data.
> >>>>>>>>>>    
> >>>>>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> >>>>>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
> >>>>>>>>>>> still may be devices using this KVM-spapr-tce-table.
> >>>>>>>>>>>
> >>>>>>>>>>> What obvious and nice solution am I missing here? Thanks.  
> >>>>>>>>>>
> >>>>>>>>>> The interactions with the IOMMU backend that seem relevant are
> >>>>>>>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> >>>>>>>>>> device is also used to tell kvm about groups as they come and go and
> >>>>>>>>>> has a way to check extensions, and thus properties of the IOMMU
> >>>>>>>>>> backend.  All of these are available for your {ab}use.  Thanks,  
> >>>>>>>>>
> >>>>>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
> >>>>>>>>> a really bad fit.  As noted above, fundamentally it's a container we
> >>>>>>>>> need to attach to the kvm-spapr-tce-table object, since what that
> >>>>>>>>> represents is a guest bus DMA address space, and by definition all the
> >>>>>>>>> groups in a container must have the same DMA address space.
> >>>>>>>>
> >>>>>>>> That's all fine and good, but the point remains that a reference to the
> >>>>>>>> container is no assurance of the iommu state.  The iommu state is
> >>>>>>>> maintained by the user and the groups attached to the container.  If
> >>>>>>>> the groups are removed, your container reference no long has any iommu
> >>>>>>>> backing and iommu_data is worthless.  The user can do this as well by
> >>>>>>>> un-setting the iommu.  I understand what you're trying to do, it's just
> >>>>>>>> wrong.  Thanks,
> >>>>>>>
> >>>>>>> I'm trying to figure out how to do this right, and it's not at all
> >>>>>>> obvious.  The container may be wrong, but that doesn't have the
> >>>>>>> KVM-VFIO device any more useful.  Attempting to do this at the group
> >>>>>>> level is at least as wrong for the reasons I've mentioned elsewhere.
> >>>>>>>
> >>>>>>
> >>>>>> I could create a new fd, one per iommu_table, the fd would reference the
> >>>>>> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
> >>>>>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
> >>>>>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
> >>>>>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
> >>>>>> release the reference, KVM-spapr-tce-table would have "unset" ioctl()
> >>>>>> or/and on every "set" I would look if all attached tables have at least one
> >>>>>> iommu_table_group attached, if none - release the table.
> >>>>>>
> >>>>>> This would make no change to generic VFIO code and very little change in
> >>>>>> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.
> >>>>>
> >>>>>
> >>>>> Ping?
> >>>>
> >>>> I'm still in Toronto after KVM Forum.  I had a detailed discussion
> >>>> about this with Alex W, which I'll write up once I get back.
> >>>>
> >>>> The short version is that Alex more-or-less convinced me that we do
> >>>> need to go back to doing this with an interface based on linking
> >>>> groups to LIOBNs.  That leads to an interface that's kind of weird and
> >>>> has some fairly counter-intuitive properties, but in the end it works
> >>>> out better than doing it with containers.
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>> Soooo? :)
> >>
> >>
> >> When can I expect a full version of how to do this in-kernel thingy?
> >> Thanks.
> > 
> > When I can dig myself out from under other things in my queue.  Which
> > turns out to be now.
> > 
> > Ok.. here's hoping I can remember enough of the conclusions I came to
> > with Alex W.
> > 
> > User <-> KVM interface
> > ----------------------
> > 
> > This needs to take an LIOBN and a group fd and associate (or
> > disassociate) them.  This should be possible to do by adding each
> > group to the vfio-kvm device as on x86, then setting an attribute on
> > the device to mark the associated liobn.
> > 
> > Attaching different (overlapping) LIOBNs to different groups in the
> > same container is boundedly undefined (i.e. it mustn't break the host,
> > but can do anything to the guest).
> > 
> > KVM <-> VFIO (in kernel) interface
> > ----------------------------------
> > 
> > You'll need a special function which takes a vfio group fd and returns
> > a reference to an iommu_table object.  It would also return an error
> > if the group isn't backed by the spapr_tce iommu driver (including any
> > calls to it on a non-ppc host).  This should probably also increment
> > the iommu table's ref count (on success).
> > 
> > Implementation notes
> > --------------------
> > 
> > When a device in a new group is hotplugged, qemu would need to add the
> > group to the container *then* tell KVM to attach the group to the
> > correct liobn(s).
> > 
> > KVM would add the group to a list for that liobn.  It would call the
> > vfio hook to get the associated iommu table.  If there's an error,
> > then it's unable to enable acceleration, and would either return an
> > error immediately or ensure that later attempts to PUT_TCE will be
> > punted to qemu.
> > 
> > Assuming it is able to accelerate, it would add the iommu table to a
> > list of iommu tables associated with the liobn.  It will need to
> > de-dupe here, since with multiple groups per container you'd expect
> > multiple groups with the same iommu table.
> > 
> > H_PUT_TCE would walk the list of attached iommu tables and update them
> > using the ppc kernel iommu interfaces.
> > 
> > When a group is removed from a liobn, kvm would need to recalculate
> > the list of iommu tables, in case that was the last group attached to
> > the table.  It would need to decrement the refcount on the iommu table
> > and, obviously, make sure everything is sychronized with the real mode
> > PUT_TCE.
> > 
> > When a group is hot unplugged, it's qemu's resposibility to tell kvm
> > that the group is no longer associated with the liobn, before it
> > removes the group from the container. 
> 
> 
> Cannot VFIO KVM device just release this extra reference when QEMU calls
> KVM_DEV_VFIO_GROUP_DEL from  vfio_instance_finalize->vfio_put_group?

Yes, that should do it.  That doesn't contradict the statement above,
it's just that we already have it - removing the group from the
vfio-kvm device is the mechanism by which qemu informs the kernel the
group is no longer associated with a liobn.

> > If it doesn't there may be a
> > stale iommu table attached to the liobn.  That could certainly mess up
> > DMA on the guest for other devices, but shouldn't damage the host -
> > the group now belongs to the host again, but because the group was
> > detached from the container, the HW is no longer using the container's
> > iommu table (which KVM is touching) to actually serve the group.
> > 
> > If all the groups are unplugged, so the container becomes quiescent,
> > KVM's refcount(s) on the iommu table stop it going away.  It won't be
> > looked at by the hardware any more, so updates will be useless, but
> > again that's only a problem for the guest, not the host.
> > 
> > 
> > Hope that covers it.
> > 
> > Alex, please let me know if I missed something from our discussion.
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-10-18  1:46 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-03  8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2016-08-03  8:40 ` [PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id" Alexey Kardashevskiy
2016-08-15  4:58   ` Paul Mackerras
2016-08-03  8:40 ` [PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER Alexey Kardashevskiy
2016-08-04  5:21   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 03/15] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2016-08-03  8:40 ` [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again Alexey Kardashevskiy
2016-08-04  5:23   ` David Gibson
2016-08-09 11:26   ` [kernel, " Michael Ellerman
2016-08-03  8:40 ` [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
2016-08-03 10:10   ` Nicholas Piggin
2016-08-05  7:00   ` Michael Ellerman
2016-08-09  5:29     ` Alexey Kardashevskiy
2016-08-09  4:43   ` Balbir Singh
2016-08-09  6:04     ` Nicholas Piggin
2016-08-09  6:17       ` Balbir Singh
2016-08-12  2:57   ` David Gibson
2016-08-12  4:56     ` Alexey Kardashevskiy
2016-08-15 10:58       ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit Alexey Kardashevskiy
2016-08-03 10:11   ` Nicholas Piggin
2016-08-12  3:13   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal Alexey Kardashevskiy
2016-08-12  3:18   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2016-08-12  3:29   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2016-08-12  4:02   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2016-08-12  4:17   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2016-08-12  4:29   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2016-08-12  4:34   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2016-08-12  4:45   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users Alexey Kardashevskiy
2016-08-08 16:43   ` Alex Williamson
2016-08-09  5:19     ` Alexey Kardashevskiy
2016-08-09 12:16       ` Alex Williamson
2016-08-10  5:37         ` Alexey Kardashevskiy
2016-08-10 16:46           ` Alex Williamson
2016-08-12  5:46             ` David Gibson
2016-08-12  6:12               ` Alexey Kardashevskiy
2016-08-15 11:07                 ` David Gibson
2016-08-17  8:31                   ` Alexey Kardashevskiy
2016-08-12 15:22               ` Alex Williamson
2016-08-17  3:17                 ` David Gibson
2016-08-18  0:22                   ` Alexey Kardashevskiy
2016-08-29  6:35                     ` Alexey Kardashevskiy
2016-08-29 13:27                       ` David Gibson
2016-09-07  9:09                         ` Alexey Kardashevskiy
2016-09-21  6:56                           ` Alexey Kardashevskiy
2016-09-23  7:12                             ` David Gibson
2016-10-17  6:06                               ` Alexey Kardashevskiy
2016-10-18  1:42                                 ` David Gibson
2016-08-15  3:59         ` Paul Mackerras
2016-08-15 15:32           ` Alex Williamson
2016-08-12  5:25   ` David Gibson
2016-08-03  8:40 ` [PATCH kernel 15/15] KVM: PPC: Add in-kernel acceleration for VFIO Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.