linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling
@ 2013-06-05  6:11 Alexey Kardashevskiy
  2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
                   ` (4 more replies)
  0 siblings, 5 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-05  6:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

Ben, ping! :)

This series has tiny fixes (capability and ioctl numbers,
changed documentation, compile errors in some configuration).
More details are in the commit messages.
Rebased on v3.10-rc4.


Alexey Kardashevskiy (4):
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt        |   45 +++
 arch/powerpc/include/asm/kvm_host.h      |    7 +
 arch/powerpc/include/asm/kvm_ppc.h       |   40 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 +
 arch/powerpc/include/uapi/asm/kvm.h      |    7 +
 arch/powerpc/kvm/book3s_64_vio.c         |  398 ++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c      |  471 ++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv.c             |   39 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c        |   37 ++-
 arch/powerpc/kvm/powerpc.c               |   15 +
 arch/powerpc/mm/init_64.c                |   77 ++++-
 include/uapi/linux/kvm.h                 |    3 +
 13 files changed, 1121 insertions(+), 28 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-05  6:11 [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
@ 2013-06-05  6:11 ` Alexey Kardashevskiy
  2013-06-16  4:20   ` Benjamin Herrenschmidt
  2013-06-16 22:06   ` Alexander Graf
  2013-06-05  6:11 ` [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-05  6:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---
Changelog:
2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt       |   17 ++
 arch/powerpc/include/asm/kvm_host.h     |    2 +
 arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
 arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
 arch/powerpc/kvm/book3s_hv.c            |   39 +++++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 10 files changed, 473 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 5f91eda..6c082ff 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
 handled.
 
 
+4.83 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability tells the guest that multiple TCE entry add/remove hypercalls
+handling is supported by the kernel. This significanly accelerates DMA
+operations for PPC KVM guests.
+
+Unlike other capabilities in this section, this one does not have an ioctl.
+Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
+H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
+the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
+hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..85d8f26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
 	spinlock_t tbacct_lock;
 	u64 busy_stolen;
 	u64 busy_preempt;
+
+	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-			     unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+		struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..06b7b20 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -36,8 +37,11 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
@@ -148,3 +152,117 @@ fail:
 	}
 	return ret;
 }
+
+/* Converts guest physical address into host virtual */
+static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+	return (void *) hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce);
+	if (ret)
+		return ret;
+
+	kvmppc_emulated_put_tce(tt, ioba, tce);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+	unsigned long __user *tces;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs so
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+			return H_PARAMETER;
+
+		ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
+		if (ret)
+			return ret;
+	}
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_emulated_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT),
+				vcpu->arch.tce_tmp[i]);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c68d538 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -35,42 +36,249 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+/* Finds a TCE table descriptor by LIOBN */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == liobn)
+			return tt;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validate TCE address.
+ * At the moment only flags are validated
+ * as other check will significantly slow down
+ * or can make it even impossible to handle TCE requests
+ * in real mode.
+ */
+long kvmppc_emulated_validate_tce(unsigned long tce)
+{
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
+
+/*
+ * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Wiorks in both real and virtual modes.
+ * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
  */
+void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/*
+	 * Note on the use of page_address() in real mode,
+	 *
+	 * It is safe to use page_address() in real mode on ppc64 because
+	 * page_address() is always defined as lowmem_page_address()
+	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+	 * operation and does not access page struct.
+	 *
+	 * Theoretically page_address() could be defined different
+	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+	 * should be enabled.
+	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+	 * is not expected to be enabled on ppc32, page_address()
+	 * is safe for ppc32 as well.
+	 */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	page = tt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+
+static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
+			unsigned long *pte_sizep)
+{
+	pte_t *ptep;
+	unsigned int shift = 0;
+	pte_t pte, tmp, ret;
+
+	ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
+	if (!ptep)
+		return __pte(0);
+	if (shift)
+		*pte_sizep = 1ul << shift;
+	else
+		*pte_sizep = PAGE_SIZE;
+
+	if (!pte_present(*ptep))
+		return __pte(0);
+
+	/* wait until _PAGE_BUSY is clear then set it atomically */
+	__asm__ __volatile__ (
+		"1:	ldarx	%0,0,%3\n"
+		"	andi.	%1,%0,%4\n"
+		"	bne-	1b\n"
+		"	ori	%1,%0,%4\n"
+		"	stdcx.	%1,0,%3\n"
+		"	bne-	1b"
+		: "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
+		: "r" (ptep), "i" (_PAGE_BUSY)
+		: "cc");
+
+	ret = pte;
+
+	return ret;
+}
+
+/*
+ * Converts guest physical address into host physical address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, hpa, pg_size = 0, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+	bool writing = gpa & TCE_PCI_WRITE;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte)
+		return ERROR_ADDR;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	return hpa;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
-	struct kvmppc_spapr_tce_table *stt;
-
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
-
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
-
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
-
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
-
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce);
+	if (ret)
+		return ret;
+
+	kvmppc_emulated_put_tce(tt, ioba, tce);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+	unsigned long *tces;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs so
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+	if ((unsigned long)tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		ret = kvmppc_emulated_validate_tce(tces[i]);
+		if (ret)
+			return ret;
 	}
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	for (i = 0; i < npages; ++i)
+		kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				tces[i]);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
 }
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 550f592..a39039a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 			ret = kvmppc_xics_hcall(vcpu, req);
 			break;
 		} /* fallthrough */
+		return RESUME_HOST;
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
@@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	kvmppc_sanity_check(vcpu);
 
+	/*
+	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+	 * half executed, we first read TCEs from the user, check them and
+	 * return error if something went wrong and only then put TCEs into
+	 * the TCE table.
+	 *
+	 * tce_tmp is a cache for TCEs to avoid stack allocation or
+	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+	 * each (4096 bytes).
+	 */
+	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
+	if (!vcpu->arch.tce_tmp)
+		goto free_vcpu;
+
 	return vcpu;
 
 free_vcpu:
@@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
 	spin_unlock(&vcpu->arch.vpa_update_lock);
+	kfree(vcpu->arch.tce_tmp);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vcpu);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..d35554e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1490,6 +1490,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index da0e0bc..91d4b45 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..8465c2a 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_SPAPR_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a5c86fc..fc0d6b9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_MPIC 90
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE 93
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
  2013-06-05  6:11 [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
  2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
@ 2013-06-05  6:11 ` Alexey Kardashevskiy
  2013-06-16  4:26   ` Benjamin Herrenschmidt
  2013-06-05  6:11 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-05  6:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---

Changes:
2013-05-20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 ++
 arch/powerpc/mm/init_64.c                |   77 +++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a90b9c4..ce3d8d4 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,80 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
 
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	if (!atomic_add_unless(&page->_count, -1, 1))
+		return -EAGAIN;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-05  6:11 [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
  2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
  2013-06-05  6:11 ` [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
@ 2013-06-05  6:11 ` Alexey Kardashevskiy
  2013-06-16  4:39   ` Benjamin Herrenschmidt
                     ` (2 more replies)
  2013-06-05  6:11 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy
  2013-06-12  3:14 ` [PATCH 0/4 v3] KVM: PPC: " Benjamin Herrenschmidt
  4 siblings, 3 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-05  6:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---

Changes:
2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 Documentation/virtual/kvm/api.txt   |   28 +++++
 arch/powerpc/include/asm/kvm_host.h |    3 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    7 ++
 arch/powerpc/kvm/book3s_64_vio.c    |  198 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/powerpc.c          |   12 +++
 include/uapi/linux/kvm.h            |    2 +
 8 files changed, 439 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 6c082ff..e962e3b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T
 hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
 
 
+4.84 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	struct iommu_group *grp;    /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
@@ -611,6 +612,8 @@ struct kvm_vcpu_arch {
 	u64 busy_preempt;
 
 	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
+	unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+	unsigned long tce_reason;  /* The reason of switching to the virtmode */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
 		struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 06b7b20..ffb4698 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -56,8 +58,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+	if (stt->grp) {
+		iommu_group_put(stt->grp);
+	} else
+#endif
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -153,6 +160,62 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+	.release	= kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *tt = NULL;
+	struct iommu_group *grp;
+	struct iommu_table *tbl;
+
+	/* Find an IOMMU table for the given ID */
+	grp = iommu_group_get_by_id(args->iommu_id);
+	if (!grp)
+		return -ENXIO;
+
+	tbl = iommu_group_get_iommudata(grp);
+	if (!tbl)
+		return -ENXIO;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+	if (!tt)
+		return -ENOMEM;
+
+	tt->liobn = args->liobn;
+	tt->kvm = kvm;
+	tt->grp = grp;
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+			args->liobn, args->iommu_id, args->flags);
+
+	return anon_inode_getfd("kvm-spapr-tce-iommu",
+			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Converts guest physical address into host virtual */
 static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
 		unsigned long gpa)
@@ -180,6 +243,46 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (vcpu->arch.tce_reason == H_HARDWARE) {
+			iommu_clear_tces_and_put_pages(tbl, entry, 1);
+			return H_HARDWARE;
+
+		} else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+			if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+				return H_PARAMETER;
+
+			ret = iommu_clear_tces_and_put_pages(tbl, entry, 1);
+		} else {
+			void *hva;
+
+			if (iommu_tce_put_param_check(tbl, ioba, tce))
+				return H_PARAMETER;
+
+			hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+			if (hva == ERROR_ADDR)
+				return H_HARDWARE;
+
+			ret = iommu_put_tce_user_mode(tbl,
+					ioba >> IOMMU_PAGE_SHIFT,
+					(unsigned long) hva);
+		}
+		iommu_flush_tce(tbl);
+
+		if (ret)
+			return H_HARDWARE;
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
@@ -220,6 +323,70 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* Something bad happened, do cleanup and exit */
+		if (vcpu->arch.tce_reason == H_HARDWARE) {
+			i = vcpu->arch.tce_tmp_num;
+			goto fail_clear_tce;
+		} else if (vcpu->arch.tce_reason != H_TOO_HARD) {
+			/*
+			 * We get here only in PR KVM mode, otherwise
+			 * the real mode handler would have checked TCEs
+			 * already and failed on guest TCE translation.
+			 */
+			for (i = 0; i < npages; ++i) {
+				if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+					return H_HARDWARE;
+
+				if (iommu_tce_put_param_check(tbl, ioba +
+						(i << IOMMU_PAGE_SHIFT),
+						vcpu->arch.tce_tmp[i]))
+					return H_PARAMETER;
+			}
+		} /* else: The real mode handler checked TCEs already */
+
+		/* Translate TCEs */
+		for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
+			void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+					vcpu->arch.tce_tmp[i]);
+
+			if (hva == ERROR_ADDR)
+				goto fail_clear_tce;
+
+			vcpu->arch.tce_tmp[i] = (unsigned long) hva;
+		}
+
+		/* Do get_page and put TCEs for all pages */
+		for (i = 0; i < npages; ++i) {
+			if (iommu_put_tce_user_mode(tbl,
+					(ioba >> IOMMU_PAGE_SHIFT) + i,
+					vcpu->arch.tce_tmp[i])) {
+				i = npages;
+				goto fail_clear_tce;
+			}
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+
+fail_clear_tce:
+		/* Cannot complete the translation, clean up and exit */
+		iommu_clear_tces_and_put_pages(tbl,
+				ioba >> IOMMU_PAGE_SHIFT, i);
+
+		iommu_flush_tce(tbl);
+
+		return H_HARDWARE;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -253,6 +420,33 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+		unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT;
+
+		vcpu->arch.tce_tmp_num = 0;
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* PR KVM? */
+		if (!vcpu->arch.tce_tmp_num &&
+				(vcpu->arch.tce_reason != H_TOO_HARD) &&
+				iommu_tce_clear_param_check(tbl, ioba,
+						tce_value, npages))
+			return H_PARAMETER;
+
+		/* Do actual cleanup */
+		tmp = vcpu->arch.tce_tmp_num;
+		if (iommu_clear_tces_and_put_pages(tbl, entry + tmp,
+				npages - tmp))
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c68d538..dc4ae32 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -118,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 
 static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
-			unsigned long *pte_sizep)
+			unsigned long *pte_sizep, bool do_get_page)
 {
 	pte_t *ptep;
 	unsigned int shift = 0;
@@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
 	if (!pte_present(*ptep))
 		return __pte(0);
 
+	/*
+	 * Put huge pages handling to the virtual mode.
+	 * The only exception is for TCE list pages which we
+	 * do need to call get_page() for.
+	 */
+	if ((*pte_sizep > PAGE_SIZE) && do_get_page)
+		return __pte(0);
+
 	/* wait until _PAGE_BUSY is clear then set it atomically */
 	__asm__ __volatile__ (
 		"1:	ldarx	%0,0,%3\n"
@@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
 		: "cc");
 
 	ret = pte;
+	if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
+		struct page *pg = NULL;
+		pg = realmode_pfn_to_page(pte_pfn(pte));
+		if (realmode_get_page(pg)) {
+			ret = __pte(0);
+		} else {
+			pte = pte_mkyoung(pte);
+			if (writing)
+				pte = pte_mkdirty(pte);
+		}
+	}
+	*ptep = pte;	/* clears _PAGE_BUSY */
 
 	return ret;
 }
@@ -157,7 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
  * Also returns pte and page size if the page is present in page table.
  */
 static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
-		unsigned long gpa)
+		unsigned long gpa, bool do_get_page)
 {
 	struct kvm_memory_slot *memslot;
 	pte_t pte;
@@ -175,7 +196,7 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 
 	/* Find a PTE and determine the size */
 	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
-			writing, &pg_size);
+			writing, &pg_size, do_get_page);
 	if (!pte)
 		return ERROR_ADDR;
 
@@ -188,6 +209,52 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 	return hpa;
 }
 
+#ifdef CONFIG_IOMMU_API
+static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret = 0, i;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page;
+		unsigned long oldtce;
+
+		oldtce = iommu_clear_tce(tbl, entry + i);
+		if (!oldtce)
+			continue;
+
+		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+		if (!page) {
+			ret = H_TOO_HARD;
+			break;
+		}
+
+		if (oldtce & TCE_PCI_WRITE)
+			SetPageDirty(page);
+
+		if (realmode_put_page(page)) {
+			ret = H_TOO_HARD;
+			break;
+		}
+	}
+
+	if (ret == H_TOO_HARD) {
+		vcpu->arch.tce_tmp_num = i;
+		vcpu->arch.tce_reason = H_TOO_HARD;
+	}
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret); */
+
+	return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -199,6 +266,52 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		vcpu->arch.tce_reason = 0;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, hva;
+
+			if (iommu_tce_put_param_check(tbl, ioba, tce))
+				return H_PARAMETER;
+
+			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+			if (hpa == ERROR_ADDR) {
+				vcpu->arch.tce_reason = H_TOO_HARD;
+				return H_TOO_HARD;
+			}
+
+			hva = (unsigned long) __va(hpa);
+			ret = iommu_tce_build(tbl,
+					ioba >> IOMMU_PAGE_SHIFT,
+					hva, iommu_tce_direction(hva));
+			if (unlikely(ret)) {
+				struct page *pg = realmode_pfn_to_page(hpa);
+				BUG_ON(!pg);
+				if (realmode_put_page(pg)) {
+					vcpu->arch.tce_reason = H_HARDWARE;
+					return H_TOO_HARD;
+				}
+				return H_HARDWARE;
+			}
+		} else {
+			ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1);
+			if (ret)
+				return ret;
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
@@ -235,10 +348,62 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tce_list & ~IOMMU_PAGE_MASK)
 		return H_PARAMETER;
 
-	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+	vcpu->arch.tce_tmp_num = 0;
+	vcpu->arch.tce_reason = 0;
+
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+			tce_list, false);
 	if ((unsigned long)tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* Check all TCEs */
+		for (i = 0; i < npages; ++i) {
+			if (iommu_tce_put_param_check(tbl, ioba +
+					(i << IOMMU_PAGE_SHIFT), tces[i]))
+				return H_PARAMETER;
+			vcpu->arch.tce_tmp[i] = tces[i];
+		}
+
+		/* Translate TCEs and go get_page */
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+					vcpu->arch.tce_tmp[i], true);
+			if (hpa == ERROR_ADDR) {
+				vcpu->arch.tce_tmp_num = i;
+				vcpu->arch.tce_reason = H_TOO_HARD;
+				return H_TOO_HARD;
+			}
+			vcpu->arch.tce_tmp[i] = hpa;
+		}
+
+		/* Put TCEs to the table */
+		for (i = 0; i < npages; ++i) {
+			unsigned long hva = (unsigned long)
+					__va(vcpu->arch.tce_tmp[i]);
+
+			ret = iommu_tce_build(tbl,
+					(ioba >> IOMMU_PAGE_SHIFT) + i,
+					hva, iommu_tce_direction(hva));
+			if (ret) {
+				/* All wrong, go virtmode and do cleanup */
+				vcpu->arch.tce_reason = H_HARDWARE;
+				return H_TOO_HARD;
+			}
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -268,6 +433,26 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		vcpu->arch.tce_reason = 0;
+
+		ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba,
+				tce_value, npages);
+		if (ret)
+			return ret;
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8465c2a..da6bf61 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		break;
 #endif
 	case KVM_CAP_SPAPR_MULTITCE:
+	case KVM_CAP_SPAPR_TCE_IOMMU:
 		r = 1;
 		break;
 	default:
@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index fc0d6b9..8cf86dc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_SPAPR_MULTITCE 93
+#define KVM_CAP_SPAPR_TCE_IOMMU 94
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -922,6 +923,7 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PPC_ALLOC_HTAB */
 #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
 #define KVM_CREATE_SPAPR_TCE	  _IOW(KVMIO,  0xa8, struct kvm_create_spapr_tce)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
  2013-06-05  6:11 [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2013-06-05  6:11 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
@ 2013-06-05  6:11 ` Alexey Kardashevskiy
  2013-06-16  4:46   ` Benjamin Herrenschmidt
  2013-06-17 16:35   ` Paolo Bonzini
  2013-06-12  3:14 ` [PATCH 0/4 v3] KVM: PPC: " Benjamin Herrenschmidt
  4 siblings, 2 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-05  6:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---

Changes:
2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   22 +++++++++
 arch/powerpc/kvm/book3s_64_vio.c    |   88 +++++++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_64_vio_hv.c |   40 ++++++++++++++--
 4 files changed, 146 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ac0e2fe..4fc0865 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
 	u64 liobn;
 	u32 window_size;
 	struct iommu_group *grp;    /* used for IOMMU groups */
+	struct list_head hugepages; /* used for IOMMU groups */
+	spinlock_t hugepages_lock;  /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 934e01d..9054df0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct kvmppc_iommu_hugepage {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long gpa;	/* Guest physical address */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+};
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index ffb4698..9e2ba4d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,6 +45,71 @@
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR      ((void *)~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the list  */
+static long kvmppc_iommu_hugepage_try_add(
+		struct kvmppc_spapr_tce_table *tt,
+		pte_t pte, unsigned long hva, unsigned long gpa,
+		unsigned long pg_size)
+{
+	long ret = 0;
+	struct kvmppc_iommu_hugepage *hp;
+	struct page *p;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry(hp, &tt->hugepages, list) {
+		if (hp->pte == pte)
+			goto unlock_exit;
+	}
+
+	hva = hva & ~(pg_size - 1);
+	ret = get_user_pages_fast(hva, 1, true/*write*/, &p);
+	if ((ret != 1) || !p) {
+		ret = -EFAULT;
+		goto unlock_exit;
+	}
+	ret = 0;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp) {
+		ret = -ENOMEM;
+		goto unlock_exit;
+	}
+
+	hp->page = p;
+	hp->pte = pte;
+	hp->gpa = gpa & ~(pg_size - 1);
+	hp->size = pg_size;
+
+	list_add(&hp->list, &tt->hugepages);
+
+unlock_exit:
+	spin_unlock(&tt->hugepages_lock);
+
+	return ret;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+	INIT_LIST_HEAD(&tt->hugepages);
+	spin_lock_init(&tt->hugepages_lock);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+	struct kvmppc_iommu_hugepage *hp, *tmp;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) {
+		list_del(&hp->list);
+		put_page(hp->page); /* one for iommu_put_tce_user_mode */
+		put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */
+		kfree(hp);
+	}
+	spin_unlock(&tt->hugepages_lock);
+}
+#endif /* CONFIG_IOMMU_API */
+
 static long kvmppc_stt_npages(unsigned long window_size)
 {
 	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 #ifdef CONFIG_IOMMU_API
 	if (stt->grp) {
 		iommu_group_put(stt->grp);
+		kvmppc_iommu_hugepages_cleanup(stt);
 	} else
 #endif
 		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
 	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
+	kvmppc_iommu_hugepages_init(tt);
 	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
 
 	mutex_unlock(&kvm->lock);
@@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
 
 /* Converts guest physical address into host virtual */
 static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
 		unsigned long gpa)
 {
 	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
 	struct kvm_memory_slot *memslot;
+	pte_t *ptep;
+	unsigned int shift = 0;
 
 	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
 	if (!memslot)
 		return ERROR_ADDR;
 
 	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+	ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+	WARN_ON(!ptep);
+	if (!ptep)
+		return ERROR_ADDR;
+#ifdef CONFIG_IOMMU_API
+	if (tt && (shift > PAGE_SHIFT)) {
+		if (kvmppc_iommu_hugepage_try_add(tt, *ptep,
+				hva, gpa, 1 << shift))
+			return ERROR_ADDR;
+	}
+#endif
 	return (void *) hva;
 }
 
@@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 			if (iommu_tce_put_param_check(tbl, ioba, tce))
 				return H_PARAMETER;
 
-			hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+			hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce);
 			if (hva == ERROR_ADDR)
 				return H_HARDWARE;
 
@@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tce_list & ~IOMMU_PAGE_MASK)
 		return H_PARAMETER;
 
-	tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+	tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list);
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
@@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
 		/* Translate TCEs */
 		for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
-			void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+			void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt,
 					vcpu->arch.tce_tmp[i]);
 
 			if (hva == ERROR_ADDR)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index dc4ae32..6245365 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
  * Also returns pte and page size if the page is present in page table.
  */
 static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
 		unsigned long gpa, bool do_get_page)
 {
 	struct kvm_memory_slot *memslot;
@@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 	unsigned long hva, hpa, pg_size = 0, offset;
 	unsigned long gfn = gpa >> PAGE_SHIFT;
 	bool writing = gpa & TCE_PCI_WRITE;
+	struct kvmppc_iommu_hugepage *hp;
 
+	/*
+	 * Try to find an already used hugepage.
+	 * If it is not there, the kvmppc_lookup_pte() will return zero
+	 * as it won't do get_page() on a huge page in real mode
+	 * and therefore the request will be passed to the virtual mode.
+	 */
+	if (tt) {
+		spin_lock(&tt->hugepages_lock);
+		list_for_each_entry(hp, &tt->hugepages, list) {
+			if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+				continue;
+
+			/* Calculate host phys address keeping flags and offset in the page */
+			offset = gpa & (hp->size - 1);
+
+			/* pte_pfn(pte) should return an address aligned to pg_size */
+			hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset;
+			spin_unlock(&tt->hugepages_lock);
+
+			return hpa;
+		}
+		spin_unlock(&tt->hugepages_lock);
+	}
 	/* Find a KVM memslot */
 	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
 	if (!memslot)
@@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
 		if (oldtce & TCE_PCI_WRITE)
 			SetPageDirty(page);
 
+		/* Do not put a huge page and continue without error */
+		if (PageCompound(page))
+			continue;
+
 		if (realmode_put_page(page)) {
 			ret = H_TOO_HARD;
 			break;
@@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			if (iommu_tce_put_param_check(tbl, ioba, tce))
 				return H_PARAMETER;
 
-			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true);
 			if (hpa == ERROR_ADDR) {
 				vcpu->arch.tce_reason = H_TOO_HARD;
 				return H_TOO_HARD;
@@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			if (unlikely(ret)) {
 				struct page *pg = realmode_pfn_to_page(hpa);
 				BUG_ON(!pg);
+
+				/* Do not put a huge page and return an error */
+				if (!PageCompound(pg))
+					return H_HARDWARE;
+
 				if (realmode_put_page(pg)) {
 					vcpu->arch.tce_reason = H_HARDWARE;
 					return H_TOO_HARD;
@@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	vcpu->arch.tce_tmp_num = 0;
 	vcpu->arch.tce_reason = 0;
 
-	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL,
 			tce_list, false);
 	if ((unsigned long)tces == ERROR_ADDR)
 		return H_TOO_HARD;
@@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
 		/* Translate TCEs and go get_page */
 		for (i = 0; i < npages; ++i) {
-			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt,
 					vcpu->arch.tce_tmp[i], true);
 			if (hpa == ERROR_ADDR) {
 				vcpu->arch.tce_tmp_num = i;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling
  2013-06-05  6:11 [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2013-06-05  6:11 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy
@ 2013-06-12  3:14 ` Benjamin Herrenschmidt
  4 siblings, 0 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-12  3:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc

On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote:
> Ben, ping! :)
> 
> This series has tiny fixes (capability and ioctl numbers,
> changed documentation, compile errors in some configuration).
> More details are in the commit messages.
> Rebased on v3.10-rc4.

Alex, I assume you'll merge that once I ack it ?

Cheers,
Ben.

> 
> Alexey Kardashevskiy (4):
>   KVM: PPC: Add support for multiple-TCE hcalls
>   powerpc: Prepare to support kernel handling of IOMMU map/unmap
>   KVM: PPC: Add support for IOMMU in-kernel handling
>   KVM: PPC: Add hugepage support for IOMMU in-kernel handling
> 
>  Documentation/virtual/kvm/api.txt        |   45 +++
>  arch/powerpc/include/asm/kvm_host.h      |    7 +
>  arch/powerpc/include/asm/kvm_ppc.h       |   40 ++-
>  arch/powerpc/include/asm/pgtable-ppc64.h |    4 +
>  arch/powerpc/include/uapi/asm/kvm.h      |    7 +
>  arch/powerpc/kvm/book3s_64_vio.c         |  398 ++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c      |  471 ++++++++++++++++++++++++++++--
>  arch/powerpc/kvm/book3s_hv.c             |   39 +++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S  |    6 +
>  arch/powerpc/kvm/book3s_pr_papr.c        |   37 ++-
>  arch/powerpc/kvm/powerpc.c               |   15 +
>  arch/powerpc/mm/init_64.c                |   77 ++++-
>  include/uapi/linux/kvm.h                 |    3 +
>  13 files changed, 1121 insertions(+), 28 deletions(-)
> 



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
@ 2013-06-16  4:20   ` Benjamin Herrenschmidt
  2013-06-16 22:06   ` Alexander Graf
  1 sibling, 0 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-16  4:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc

On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> 
> ---
> Changelog:
> 2013/06/05:
> * fixed mistype about IBMVIO in the commit message
> * updated doc and moved it to another section
> * changed capability number
> 
> 2013/05/21:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
> ---
>  Documentation/virtual/kvm/api.txt       |   17 ++
>  arch/powerpc/include/asm/kvm_host.h     |    2 +
>  arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>  arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>  arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>  arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>  arch/powerpc/kvm/powerpc.c              |    3 +
>  include/uapi/linux/kvm.h                |    1 +
>  10 files changed, 473 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 5f91eda..6c082ff 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>  handled.
>  
> 
> +4.83 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability tells the guest that multiple TCE entry add/remove hypercalls
> +handling is supported by the kernel. This significanly accelerates DMA
> +operations for PPC KVM guests.
> +
> +Unlike other capabilities in this section, this one does not have an ioctl.
> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
> +
> +
>  5. The kvm_run structure
>  ------------------------
>  
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..85d8f26 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>  	spinlock_t tbacct_lock;
>  	u64 busy_stolen;
>  	u64 busy_preempt;
> +
> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>  #endif
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index a5287fe..e852921b 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce *args);
> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> -			     unsigned long ioba, unsigned long tce);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> +		struct kvm_vcpu *vcpu, unsigned long liobn);
> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages);
>  extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>  				struct kvm_allocate_rma *rma);
>  extern struct kvmppc_linear_info *kvm_alloc_rma(void);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index b2d3f3b..06b7b20 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -36,8 +37,11 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>  
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
> @@ -148,3 +152,117 @@ fail:
>  	}
>  	return ret;
>  }
> +
> +/* Converts guest physical address into host virtual */
> +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)
> +{
> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
> +	struct kvm_memory_slot *memslot;
> +
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
> +	return (void *) hva;
> +}
> +
> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	long ret;
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if (ioba >= tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce);
> +	if (ret)
> +		return ret;
> +
> +	kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +	unsigned long __user *tces;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/*
> +	 * The spec says that the maximum size of the list is 512 TCEs so
> +	 * so the whole table addressed resides in 4K page
> +	 */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	if (tce_list & ~IOMMU_PAGE_MASK)
> +		return H_PARAMETER;
> +
> +	tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (get_user(vcpu->arch.tce_tmp[i], tces + i))
> +			return H_PARAMETER;
> +
> +		ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_emulated_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT),
> +				vcpu->arch.tce_tmp[i]);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce_value);
> +	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..c68d538 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -35,42 +36,249 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - *          mode on PR KVM
> +/* Finds a TCE table descriptor by LIOBN */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> +		unsigned long liobn)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == liobn)
> +			return tt;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +/*
> + * Validate TCE address.
> + * At the moment only flags are validated
> + * as other check will significantly slow down
> + * or can make it even impossible to handle TCE requests
> + * in real mode.
> + */
> +long kvmppc_emulated_validate_tce(unsigned long tce)
> +{
> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
> +
> +/*
> + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
> + * by QEMU. It puts guest TCE values into the table and expects
> + * the QEMU to convert them later in the QEMU device implementation.
> + * Wiorks in both real and virtual modes.
> + * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
>   */
> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/*
> +	 * Note on the use of page_address() in real mode,
> +	 *
> +	 * It is safe to use page_address() in real mode on ppc64 because
> +	 * page_address() is always defined as lowmem_page_address()
> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> +	 * operation and does not access page struct.
> +	 *
> +	 * Theoretically page_address() could be defined different
> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> +	 * should be enabled.
> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> +	 * is not expected to be enabled on ppc32, page_address()
> +	 * is safe for ppc32 as well.
> +	 */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	page = tt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);
> +
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +
> +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
> +			unsigned long *pte_sizep)
> +{
> +	pte_t *ptep;
> +	unsigned int shift = 0;
> +	pte_t pte, tmp, ret;
> +
> +	ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
> +	if (!ptep)
> +		return __pte(0);
> +	if (shift)
> +		*pte_sizep = 1ul << shift;
> +	else
> +		*pte_sizep = PAGE_SIZE;
> +
> +	if (!pte_present(*ptep))
> +		return __pte(0);
> +
> +	/* wait until _PAGE_BUSY is clear then set it atomically */
> +	__asm__ __volatile__ (
> +		"1:	ldarx	%0,0,%3\n"
> +		"	andi.	%1,%0,%4\n"
> +		"	bne-	1b\n"
> +		"	ori	%1,%0,%4\n"
> +		"	stdcx.	%1,0,%3\n"
> +		"	bne-	1b"
> +		: "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
> +		: "r" (ptep), "i" (_PAGE_BUSY)
> +		: "cc");
> +
> +	ret = pte;
> +
> +	return ret;
> +}

The test for pte_present() needs to be done again after you lock the PTE
since potentially you could have raced with an invalidation.

More worrisome: You set _PAGE_BUSY above (lock the PTE). But when do you
clear it ?

It looks like you rely on _PAGE_BUSY being set to protect yourself
against any concurrent invalidation (or other change to the PTE) while
you access the underlying page. This is *somewhat* ok, though frowned
upon since you end up locking the PTE a lot longer (in real mode) than
we normally do, but in any case, you need to ensure that you release
that lock.

Also you must *not* continue using the resulting physical address after
releasing the lock since the page might be invalidated/freed/swapped_out
etc... at any point once you clear busy.

It might be better to use the MMU notifiers here to catch concurrent
invalidations rather than locking the PTE for a long time. If a
concurrent invalidation happens, just return TOO_HARD.

> +
> +/*
> + * Converts guest physical address into host physical address.
> + * Also returns pte and page size if the page is present in page table.
> + */
> +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)
> +{
> +	struct kvm_memory_slot *memslot;
> +	pte_t pte;
> +	unsigned long hva, hpa, pg_size = 0, offset;
> +	unsigned long gfn = gpa >> PAGE_SHIFT;
> +	bool writing = gpa & TCE_PCI_WRITE;
> +
> +	/* Find a KVM memslot */
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/* Convert guest physical address to host virtual */
> +	hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> +	/* Find a PTE and determine the size */
> +	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte)
> +		return ERROR_ADDR;
> +
> +	/* Calculate host phys address keeping flags and offset in the page */
> +	offset = gpa & (pg_size - 1);
> +
> +	/* pte_pfn(pte) should return an address aligned to pg_size */
> +	hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
> +
> +	return hpa;
> +}

Do you ever test whether the page protection on the PTE allows for
access ? At the moment you only use that to read from TCEs so chances
that this is wrong are slim (you wouldn't have PROT_NONE on TCE tables),
but it's still a worry to have code like that.

Also you do not set _PAGE_ACCESSED either, which means the VM doesn't
know the page is being accessed. Not necessarily a huge deal in this
specific case, but still.

>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
> -	struct kvmppc_spapr_tce_table *stt;
> -
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> -
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> -
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> -
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> -
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> +	long ret;
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if (ioba >= tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce);
> +	if (ret)
> +		return ret;
> +
> +	kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +	unsigned long *tces;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/*
> +	 * The spec says that the maximum size of the list is 512 TCEs so
> +	 * so the whole table addressed resides in 4K page
> +	 */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	if (tce_list & ~IOMMU_PAGE_MASK)
> +		return H_PARAMETER;
> +
> +	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
> +	if ((unsigned long)tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		ret = kvmppc_emulated_validate_tce(tces[i]);
> +		if (ret)
> +			return ret;
>  	}
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				tces[i]);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce_value);
> +	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
>  }
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 550f592..a39039a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_xics_hcall(vcpu, req);
>  			break;
>  		} /* fallthrough */
> +		return RESUME_HOST;
> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>  	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>  	kvmppc_sanity_check(vcpu);
>  
> +	/*
> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
> +	 * half executed, we first read TCEs from the user, check them and
> +	 * return error if something went wrong and only then put TCEs into
> +	 * the TCE table.
> +	 *
> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
> +	 * each (4096 bytes).
> +	 */
> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
> +	if (!vcpu->arch.tce_tmp)
> +		goto free_vcpu;
> +
>  	return vcpu;
>  
>  free_vcpu:
> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>  	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>  	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>  	spin_unlock(&vcpu->arch.vpa_update_lock);
> +	kfree(vcpu->arch.tce_tmp);
>  	kvm_vcpu_uninit(vcpu);
>  	kmem_cache_free(kvm_vcpu_cache, vcpu);
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b02f91e..d35554e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1490,6 +1490,12 @@ hcall_real_table:
>  	.long	0		/* 0x11c */
>  	.long	0		/* 0x120 */
>  	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>  hcall_real_table_end:
>  
>  ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index da0e0bc..91d4b45 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>  	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>  	long rc;
>  
> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  	if (rc == H_TOO_HARD)
>  		return EMULATE_FAIL;
>  	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>  		return kvmppc_h_pr_bulk_remove(vcpu);
>  	case H_PUT_TCE:
>  		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
>  	case H_CEDE:
>  		vcpu->arch.shared->msr |= MSR_EE;
>  		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 6316ee3..8465c2a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a5c86fc..fc0d6b9 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_IRQ_MPIC 90
>  #define KVM_CAP_PPC_RTAS 91
>  #define KVM_CAP_IRQ_XICS 92
> +#define KVM_CAP_SPAPR_MULTITCE 93
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
  2013-06-05  6:11 ` [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
@ 2013-06-16  4:26   ` Benjamin Herrenschmidt
  2013-06-16  4:31     ` Benjamin Herrenschmidt
  2013-06-17  9:17     ` Alexey Kardashevskiy
  0 siblings, 2 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-16  4:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc, linux-mm

> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
> +int realmode_get_page(struct page *page)
> +{
> +	if (PageCompound(page))
> +		return -EAGAIN;
> +
> +	get_page(page);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(realmode_get_page);
> +
> +int realmode_put_page(struct page *page)
> +{
> +	if (PageCompound(page))
> +		return -EAGAIN;
> +
> +	if (!atomic_add_unless(&page->_count, -1, 1))
> +		return -EAGAIN;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(realmode_put_page);
> +#endif

Several worries here, mostly that if the generic code ever changes
(something gets added to get_page() that makes it no-longer safe for use
in real mode for example, or some other condition gets added to
put_page()), we go out of sync and potentially end up with very hard and
very subtle bugs.

It might be worth making sure that:

 - This is reviewed by some generic VM people (and make sure they
understand why we need to do that)

 - A comment is added to get_page() and put_page() to make sure that if
they are changed in any way, dbl check the impact on our
realmode_get_page() (or "ping" us to make sure things are still ok).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
  2013-06-16  4:26   ` Benjamin Herrenschmidt
@ 2013-06-16  4:31     ` Benjamin Herrenschmidt
  2013-06-17  9:17     ` Alexey Kardashevskiy
  1 sibling, 0 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-16  4:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc, linux-mm

On Sun, 2013-06-16 at 14:26 +1000, Benjamin Herrenschmidt wrote:
> > +int realmode_get_page(struct page *page)
> > +{
> > +     if (PageCompound(page))
> > +             return -EAGAIN;
> > +
> > +     get_page(page);
> > +
> > +     return 0;
> > +}

Shouldn't it be get_page_unless_zero ?

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-05  6:11 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
@ 2013-06-16  4:39   ` Benjamin Herrenschmidt
  2013-06-19  3:17     ` Alexey Kardashevskiy
  2013-06-16 22:25   ` Alexander Graf
  2013-06-16 22:39   ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-16  4:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc

>  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
> -			unsigned long *pte_sizep)
> +			unsigned long *pte_sizep, bool do_get_page)
>  {
>  	pte_t *ptep;
>  	unsigned int shift = 0;
> @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>  	if (!pte_present(*ptep))
>  		return __pte(0);
>  
> +	/*
> +	 * Put huge pages handling to the virtual mode.
> +	 * The only exception is for TCE list pages which we
> +	 * do need to call get_page() for.
> +	 */
> +	if ((*pte_sizep > PAGE_SIZE) && do_get_page)
> +		return __pte(0);
> +
>  	/* wait until _PAGE_BUSY is clear then set it atomically */
>  	__asm__ __volatile__ (
>  		"1:	ldarx	%0,0,%3\n"
> @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>  		: "cc");
>  
>  	ret = pte;
> +	if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
> +		struct page *pg = NULL;
> +		pg = realmode_pfn_to_page(pte_pfn(pte));
> +		if (realmode_get_page(pg)) {
> +			ret = __pte(0);
> +		} else {
> +			pte = pte_mkyoung(pte);
> +			if (writing)
> +				pte = pte_mkdirty(pte);
> +		}
> +	}
> +	*ptep = pte;	/* clears _PAGE_BUSY */
>  
>  	return ret;
>  }

So now you are adding the clearing of _PAGE_BUSY that was missing for
your first patch, except that this is not enough since that means that
in the "emulated" case (ie, !do_get_page) you will in essence return
and then use a PTE that is not locked without any synchronization to
ensure that the underlying page doesn't go away... then you'll
dereference that page.

So either make everything use speculative get_page, or make the emulated
case use the MMU notifier to drop the operation in case of collision.

The former looks easier.

Also, any specific reason why you do:

  - Lock the PTE
  - get_page()
  - Unlock the PTE

Instead of

  - Read the PTE
  - get_page_unless_zero
  - re-check PTE

Like get_user_pages_fast() does ?

The former will be two atomic ops, the latter only one (faster), but
maybe you have a good reason why that can't work...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
  2013-06-05  6:11 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy
@ 2013-06-16  4:46   ` Benjamin Herrenschmidt
  2013-06-17 16:35   ` Paolo Bonzini
  1 sibling, 0 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-16  4:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc

On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote:

> @@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
>  	unsigned long hva, hpa, pg_size = 0, offset;
>  	unsigned long gfn = gpa >> PAGE_SHIFT;
>  	bool writing = gpa & TCE_PCI_WRITE;
> +	struct kvmppc_iommu_hugepage *hp;
>  
> +	/*
> +	 * Try to find an already used hugepage.
> +	 * If it is not there, the kvmppc_lookup_pte() will return zero
> +	 * as it won't do get_page() on a huge page in real mode
> +	 * and therefore the request will be passed to the virtual mode.
> +	 */
> +	if (tt) {
> +		spin_lock(&tt->hugepages_lock);
> +		list_for_each_entry(hp, &tt->hugepages, list) {
> +			if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
> +				continue;
> +
> +			/* Calculate host phys address keeping flags and offset in the page */
> +			offset = gpa & (hp->size - 1);
> +
> +			/* pte_pfn(pte) should return an address aligned to pg_size */
> +			hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset;
> +			spin_unlock(&tt->hugepages_lock);
> +
> +			return hpa;
> +		}
> +		spin_unlock(&tt->hugepages_lock);
> +	}

Wow .... this is run in real mode right ?

spin_lock() and spin_unlock() are a big no-no in real mode. If lockdep
and/or spinlock debugging are enabled and something goes pear-shaped
they are going to bring your whole system down in a blink in quite
horrible ways.

If you are going to do that, you need some kind of custom low-level
lock.

Also, I see that you are basically using a non-ordered list and doing a
linear search in it every time. That's going to COST !

You should really consider a more efficient data structure. You should
also be able to do something that doesn't require locks for readers.

>  	/* Find a KVM memslot */
>  	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>  	if (!memslot)
> @@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
>  		if (oldtce & TCE_PCI_WRITE)
>  			SetPageDirty(page);
>  
> +		/* Do not put a huge page and continue without error */
> +		if (PageCompound(page))
> +			continue;
> +
>  		if (realmode_put_page(page)) {
>  			ret = H_TOO_HARD;
>  			break;
> @@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  			if (iommu_tce_put_param_check(tbl, ioba, tce))
>  				return H_PARAMETER;
>  
> -			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
> +			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true);
>  			if (hpa == ERROR_ADDR) {
>  				vcpu->arch.tce_reason = H_TOO_HARD;
>  				return H_TOO_HARD;
> @@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  			if (unlikely(ret)) {
>  				struct page *pg = realmode_pfn_to_page(hpa);
>  				BUG_ON(!pg);
> +
> +				/* Do not put a huge page and return an error */
> +				if (!PageCompound(pg))
> +					return H_HARDWARE;
> +
>  				if (realmode_put_page(pg)) {
>  					vcpu->arch.tce_reason = H_HARDWARE;
>  					return H_TOO_HARD;
> @@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	vcpu->arch.tce_tmp_num = 0;
>  	vcpu->arch.tce_reason = 0;
>  
> -	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
> +	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL,
>  			tce_list, false);
>  	if ((unsigned long)tces == ERROR_ADDR)
>  		return H_TOO_HARD;
> @@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  
>  		/* Translate TCEs and go get_page */
>  		for (i = 0; i < npages; ++i) {
> -			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
> +			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt,
>  					vcpu->arch.tce_tmp[i], true);
>  			if (hpa == ERROR_ADDR) {
>  				vcpu->arch.tce_tmp_num = i;

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
  2013-06-16  4:20   ` Benjamin Herrenschmidt
@ 2013-06-16 22:06   ` Alexander Graf
  2013-06-17  7:55     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 69+ messages in thread
From: Alexander Graf @ 2013-06-16 22:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc


On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:

> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Only a few minor nits. Ben already commented on implementation details.

> 
> ---
> Changelog:
> 2013/06/05:
> * fixed mistype about IBMVIO in the commit message
> * updated doc and moved it to another section
> * changed capability number
> 
> 2013/05/21:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
> ---
> Documentation/virtual/kvm/api.txt       |   17 ++
> arch/powerpc/include/asm/kvm_host.h     |    2 +
> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
> arch/powerpc/kvm/powerpc.c              |    3 +
> include/uapi/linux/kvm.h                |    1 +
> 10 files changed, 473 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 5f91eda..6c082ff 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
> handled.
> 
> 
> +4.83 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability tells the guest that multiple TCE entry add/remove hypercalls
> +handling is supported by the kernel. This significanly accelerates DMA
> +operations for PPC KVM guests.
> +
> +Unlike other capabilities in this section, this one does not have an ioctl.
> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).

While this describes perfectly well what the consequences are of the patches, it does not describe properly what the CAP actually expresses. The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All other consequences are nice to document, but the semantics of the CAP are missing.

We also usually try to keep KVM behavior unchanged with regards to older versions until a CAP is enabled. In this case I don't think it matters all that much, so I'm fine with declaring it as enabled by default. Please document that this is a change in behavior versus older KVM versions though.

> +
> +
> 5. The kvm_run structure
> ------------------------
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..85d8f26 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
> 	spinlock_t tbacct_lock;
> 	u64 busy_stolen;
> 	u64 busy_preempt;
> +
> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
> #endif
> };

[...]
> 
> 

[...]

> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 550f592..a39039a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
> 			ret = kvmppc_xics_hcall(vcpu, req);
> 			break;
> 		} /* fallthrough */

The fallthrough comment isn't accurate anymore.

> +		return RESUME_HOST;
> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> 	default:
> 		return RESUME_HOST;
> 	}
> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
> 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
> 	kvmppc_sanity_check(vcpu);
> 
> +	/*
> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
> +	 * half executed, we first read TCEs from the user, check them and
> +	 * return error if something went wrong and only then put TCEs into
> +	 * the TCE table.
> +	 *
> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
> +	 * each (4096 bytes).
> +	 */
> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
> +	if (!vcpu->arch.tce_tmp)
> +		goto free_vcpu;
> +
> 	return vcpu;
> 
> free_vcpu:
> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
> 	spin_unlock(&vcpu->arch.vpa_update_lock);
> +	kfree(vcpu->arch.tce_tmp);
> 	kvm_vcpu_uninit(vcpu);
> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
> }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b02f91e..d35554e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1490,6 +1490,12 @@ hcall_real_table:
> 	.long	0		/* 0x11c */
> 	.long	0		/* 0x120 */
> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
> hcall_real_table_end:
> 
> ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index da0e0bc..91d4b45 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
> 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> 	long rc;
> 
> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
> 	if (rc == H_TOO_HARD)
> 		return EMULATE_FAIL;
> 	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
> 		return kvmppc_h_pr_bulk_remove(vcpu);
> 	case H_PUT_TCE:
> 		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
> 	case H_CEDE:
> 		vcpu->arch.shared->msr |= MSR_EE;
> 		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 6316ee3..8465c2a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
> 		r = 1;
> 		break;
> #endif
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r = 1;

This should only be true for book3s.



Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-05  6:11 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
  2013-06-16  4:39   ` Benjamin Herrenschmidt
@ 2013-06-16 22:25   ` Alexander Graf
  2013-06-16 22:39   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 69+ messages in thread
From: Alexander Graf @ 2013-06-16 22:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc


On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests without passing them to QEMU, which should
> save time on switching to QEMU and back.
> 
> Both real and virtual modes are supported - whenever the kernel
> fails to handle TCE request, it passes it to the virtual mode.
> If it the virtual mode handlers fail, then the request is passed
> to the user mode, for example, to QEMU.
> 
> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> in-kernel handling of IOMMU map/unmap.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> 
> ---
> 
> Changes:
> 2013/06/05:
> * changed capability number
> * changed ioctl number
> * update the doc article number
> 
> 2013/05/20:
> * removed get_user() from real mode handlers
> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> translated TCEs, tries realmode_get_page() on those and if it fails, it
> passes control over the virtual mode handler which tries to finish
> the request handling
> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> on a page
> * The only reason to pass the request to user mode now is when the user mode
> did not register TCE table in the kernel, in all other cases the virtual mode
> handler is expected to do the job
> ---
> Documentation/virtual/kvm/api.txt   |   28 +++++
> arch/powerpc/include/asm/kvm_host.h |    3 +
> arch/powerpc/include/asm/kvm_ppc.h  |    2 +
> arch/powerpc/include/uapi/asm/kvm.h |    7 ++
> arch/powerpc/kvm/book3s_64_vio.c    |  198 ++++++++++++++++++++++++++++++++++-
> arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +++++++++++++++++++++++++++++++++-
> arch/powerpc/kvm/powerpc.c          |   12 +++
> include/uapi/linux/kvm.h            |    2 +
> 8 files changed, 439 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 6c082ff..e962e3b 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T
> hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
> 
> 
> +4.84 KVM_CREATE_SPAPR_TCE_IOMMU
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> +Returns: 0 on success, -1 on error
> +
> +This creates a link between IOMMU group and a hardware TCE (translation
> +control entry) table. This link lets the host kernel know what IOMMU
> +group (i.e. TCE table) to use for the LIOBN number passed with
> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> +
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;
> +	__u32 flags;
> +};
> +
> +No flag is supported at the moment.
> +
> +When the guest issues TCE call on a liobn for which a TCE table has been
> +registered, the kernel will handle it in real mode, updating the hardware
> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> +be handled by userspace.

Ok, please walk me through the security model you have in mind here.

Basically what this ioctl does is that it creates a guest TCE table that reflects its changes into a host TCE table whenever it gets modified. So far so good.

Now I don't see any checks that verify whether iommu_id is actually good to use from that user's access rights. Just because I have access to /dev/kvm I don't necessarily have access to an iommu control device.

So the least I can see would be a local DoS attack where one user space program with only access to /dev/kvm can simply kill any access to another process's device by overflowing a host iommu TCE table with junk entries.

There's even a certain chance of an information disclosure exploit here where a malicious user space program could get itself all network traffic DMA'd from another VM.

How does this work on the host level? What is the security token to take control of a host TCE table?


Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-05  6:11 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
  2013-06-16  4:39   ` Benjamin Herrenschmidt
  2013-06-16 22:25   ` Alexander Graf
@ 2013-06-16 22:39   ` Benjamin Herrenschmidt
  2013-06-17  3:13     ` Alex Williamson
  2 siblings, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-16 22:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc, Alex Williamson

On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote:
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +               struct kvm_create_spapr_tce_iommu *args)
> +{
> +       struct kvmppc_spapr_tce_table *tt = NULL;
> +       struct iommu_group *grp;
> +       struct iommu_table *tbl;
> +
> +       /* Find an IOMMU table for the given ID */
> +       grp = iommu_group_get_by_id(args->iommu_id);
> +       if (!grp)
> +               return -ENXIO;
> +
> +       tbl = iommu_group_get_iommudata(grp);
> +       if (!tbl)
> +               return -ENXIO;

So Alex Graf pointed out here, there is a security issue here, or are we
missing something ?

What prevents a malicious program that has access to /dev/kvm from
taking over random iommu groups (including host used ones) that way?

What is the security model of that whole iommu stuff to begin with ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-16 22:39   ` Benjamin Herrenschmidt
@ 2013-06-17  3:13     ` Alex Williamson
  2013-06-17  3:56       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2013-06-17  3:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

On Mon, 2013-06-17 at 08:39 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote:
> > +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> > +               struct kvm_create_spapr_tce_iommu *args)
> > +{
> > +       struct kvmppc_spapr_tce_table *tt = NULL;
> > +       struct iommu_group *grp;
> > +       struct iommu_table *tbl;
> > +
> > +       /* Find an IOMMU table for the given ID */
> > +       grp = iommu_group_get_by_id(args->iommu_id);
> > +       if (!grp)
> > +               return -ENXIO;
> > +
> > +       tbl = iommu_group_get_iommudata(grp);
> > +       if (!tbl)
> > +               return -ENXIO;
> 
> So Alex Graf pointed out here, there is a security issue here, or are we
> missing something ?
> 
> What prevents a malicious program that has access to /dev/kvm from
> taking over random iommu groups (including host used ones) that way?
> 
> What is the security model of that whole iommu stuff to begin with ?

IOMMU groups themselves don't provide security, they're accessed by
interfaces like VFIO, which provide the security.  Given a brief look, I
agree, this looks like a possible backdoor.  The typical VFIO way to
handle this would be to pass a VFIO file descriptor here to prove that
the process has access to the IOMMU group.  This is how /dev/vfio/vfio
gains the ability to setup an IOMMU domain an do mappings with the
SET_CONTAINER ioctl using a group fd.  Thanks,

Alex




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-17  3:13     ` Alex Williamson
@ 2013-06-17  3:56       ` Benjamin Herrenschmidt
  2013-06-18  2:32         ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-17  3:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:

> IOMMU groups themselves don't provide security, they're accessed by
> interfaces like VFIO, which provide the security.  Given a brief look, I
> agree, this looks like a possible backdoor.  The typical VFIO way to
> handle this would be to pass a VFIO file descriptor here to prove that
> the process has access to the IOMMU group.  This is how /dev/vfio/vfio
> gains the ability to setup an IOMMU domain an do mappings with the
> SET_CONTAINER ioctl using a group fd.  Thanks,

How do you envision that in the kernel ? IE. I'm in KVM code, gets that
vfio fd, what do I do with it ?

Basically, KVM needs to know that the user is allowed to use that iommu
group. I don't think we want KVM however to call into VFIO directly
right ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-16 22:06   ` Alexander Graf
@ 2013-06-17  7:55     ` Alexey Kardashevskiy
  2013-06-17  8:02       ` Alexander Graf
  2013-06-17  8:37       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-17  7:55 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

On 06/17/2013 08:06 AM, Alexander Graf wrote:
> 
> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
> 
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>> devices or emulated PCI.  These calls allow adding multiple entries
>> (up to 512) into the TCE table in one call which saves time on
>> transition to/from real mode.
>>
>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>> (copied from user and verified) before writing the whole list into
>> the TCE table. This cache will be utilized more in the upcoming
>> VFIO/IOMMU support to continue TCE list processing in the virtual
>> mode in the case if the real mode handler failed for some reason.
>>
>> This adds a guest physical to host real address converter
>> and calls the existing H_PUT_TCE handler. The converting function
>> is going to be fully utilized by upcoming VFIO supporting patches.
>>
>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>> so in order to support the functionality of this patch, QEMU
>> needs to query for this capability and set the "hcall-multi-tce"
>> hypertas property only if the capability is present, otherwise
>> there will be serious performance degradation.
>>
>> Cc: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Signed-off-by: Paul Mackerras <paulus@samba.org>
> 
> Only a few minor nits. Ben already commented on implementation details.
> 
>>
>> ---
>> Changelog:
>> 2013/06/05:
>> * fixed mistype about IBMVIO in the commit message
>> * updated doc and moved it to another section
>> * changed capability number
>>
>> 2013/05/21:
>> * added kvm_vcpu_arch::tce_tmp
>> * removed cleanup if put_indirect failed, instead we do not even start
>> writing to TCE table if we cannot get TCEs from the user and they are
>> invalid
>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>> and kvmppc_emulated_validate_tce (for the previous item)
>> * fixed bug with failthrough for H_IPI
>> * removed all get_user() from real mode handlers
>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>> ---
>> Documentation/virtual/kvm/api.txt       |   17 ++
>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>> arch/powerpc/kvm/powerpc.c              |    3 +
>> include/uapi/linux/kvm.h                |    1 +
>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index 5f91eda..6c082ff 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>> handled.
>>
>>
>> +4.83 KVM_CAP_PPC_MULTITCE
>> +
>> +Capability: KVM_CAP_PPC_MULTITCE
>> +Architectures: ppc
>> +Type: vm
>> +
>> +This capability tells the guest that multiple TCE entry add/remove hypercalls
>> +handling is supported by the kernel. This significanly accelerates DMA
>> +operations for PPC KVM guests.
>> +
>> +Unlike other capabilities in this section, this one does not have an ioctl.
>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
> 

> While this describes perfectly well what the consequences are of the
> patches, it does not describe properly what the CAP actually expresses.
> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls directly". All other consequences are nice to
> document, but the semantics of the CAP are missing.


? It expresses ability to handle 2 hcalls. What is missing?


> We also usually try to keep KVM behavior unchanged with regards to older
> versions until a CAP is enabled. In this case I don't think it matters
> all that much, so I'm fine with declaring it as enabled by default.
> Please document that this is a change in behavior versus older KVM
> versions though.


Ok!


>> +
>> +
>> 5. The kvm_run structure
>> ------------------------
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index af326cd..85d8f26 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>> 	spinlock_t tbacct_lock;
>> 	u64 busy_stolen;
>> 	u64 busy_preempt;
>> +
>> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>> #endif
>> };
> 
> [...]
>>
>>
> 
> [...]
> 
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 550f592..a39039a 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>> 			ret = kvmppc_xics_hcall(vcpu, req);
>> 			break;
>> 		} /* fallthrough */
> 
> The fallthrough comment isn't accurate anymore.
> 
>> +		return RESUME_HOST;
>> +	case H_PUT_TCE:
>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +						kvmppc_get_gpr(vcpu, 5),
>> +						kvmppc_get_gpr(vcpu, 6));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_PUT_TCE_INDIRECT:
>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +						kvmppc_get_gpr(vcpu, 5),
>> +						kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_STUFF_TCE:
>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +						kvmppc_get_gpr(vcpu, 5),
>> +						kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> 	default:
>> 		return RESUME_HOST;
>> 	}
>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>> 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>> 	kvmppc_sanity_check(vcpu);
>>
>> +	/*
>> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
>> +	 * half executed, we first read TCEs from the user, check them and
>> +	 * return error if something went wrong and only then put TCEs into
>> +	 * the TCE table.
>> +	 *
>> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
>> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
>> +	 * each (4096 bytes).
>> +	 */
>> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
>> +	if (!vcpu->arch.tce_tmp)
>> +		goto free_vcpu;
>> +
>> 	return vcpu;
>>
>> free_vcpu:
>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>> 	spin_unlock(&vcpu->arch.vpa_update_lock);
>> +	kfree(vcpu->arch.tce_tmp);
>> 	kvm_vcpu_uninit(vcpu);
>> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
>> }
>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> index b02f91e..d35554e 100644
>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>> 	.long	0		/* 0x11c */
>> 	.long	0		/* 0x120 */
>> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
>> +	.long	0		/* 0x128 */
>> +	.long	0		/* 0x12c */
>> +	.long	0		/* 0x130 */
>> +	.long	0		/* 0x134 */
>> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
>> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>> hcall_real_table_end:
>>
>> ignore_hdec:
>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
>> index da0e0bc..91d4b45 100644
>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>> 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>> 	long rc;
>>
>> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>> +	if (rc == H_TOO_HARD)
>> +		return EMULATE_FAIL;
>> +	kvmppc_set_gpr(vcpu, 3, rc);
>> +	return EMULATE_DONE;
>> +}
>> +
>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>> +	long rc;
>> +
>> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>> +			tce, npages);
>> +	if (rc == H_TOO_HARD)
>> +		return EMULATE_FAIL;
>> +	kvmppc_set_gpr(vcpu, 3, rc);
>> +	return EMULATE_DONE;
>> +}
>> +
>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>> +	long rc;
>> +
>> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>> 	if (rc == H_TOO_HARD)
>> 		return EMULATE_FAIL;
>> 	kvmppc_set_gpr(vcpu, 3, rc);
>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>> 		return kvmppc_h_pr_bulk_remove(vcpu);
>> 	case H_PUT_TCE:
>> 		return kvmppc_h_pr_put_tce(vcpu);
>> +	case H_PUT_TCE_INDIRECT:
>> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
>> +	case H_STUFF_TCE:
>> +		return kvmppc_h_pr_stuff_tce(vcpu);
>> 	case H_CEDE:
>> 		vcpu->arch.shared->msr |= MSR_EE;
>> 		kvm_vcpu_block(vcpu);
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 6316ee3..8465c2a 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>> 		r = 1;
>> 		break;
>> #endif
>> +	case KVM_CAP_SPAPR_MULTITCE:
>> +		r = 1;
> 
> This should only be true for book3s.


We had this discussion with v2.

David:
===
So, in the case of MULTITCE, that's not quite right.  PR KVM can
emulate a PAPR system on a BookE machine, and there's no reason not to
allow TCE acceleration as well.  We can't make it dependent on PAPR
mode being selected, because that's enabled per-vcpu, whereas these
capabilities are queried on the VM before the vcpus are created.
===

Wrong?


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  7:55     ` Alexey Kardashevskiy
@ 2013-06-17  8:02       ` Alexander Graf
  2013-06-17  8:34         ` Alexey Kardashevskiy
  2013-06-17  8:37       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 69+ messages in thread
From: Alexander Graf @ 2013-06-17  8:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc


On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:

> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>> 
>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>> 
>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>> devices or emulated PCI.  These calls allow adding multiple entries
>>> (up to 512) into the TCE table in one call which saves time on
>>> transition to/from real mode.
>>> 
>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>> (copied from user and verified) before writing the whole list into
>>> the TCE table. This cache will be utilized more in the upcoming
>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>> mode in the case if the real mode handler failed for some reason.
>>> 
>>> This adds a guest physical to host real address converter
>>> and calls the existing H_PUT_TCE handler. The converting function
>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>> 
>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>> so in order to support the functionality of this patch, QEMU
>>> needs to query for this capability and set the "hcall-multi-tce"
>>> hypertas property only if the capability is present, otherwise
>>> there will be serious performance degradation.
>>> 
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>> 
>> Only a few minor nits. Ben already commented on implementation details.
>> 
>>> 
>>> ---
>>> Changelog:
>>> 2013/06/05:
>>> * fixed mistype about IBMVIO in the commit message
>>> * updated doc and moved it to another section
>>> * changed capability number
>>> 
>>> 2013/05/21:
>>> * added kvm_vcpu_arch::tce_tmp
>>> * removed cleanup if put_indirect failed, instead we do not even start
>>> writing to TCE table if we cannot get TCEs from the user and they are
>>> invalid
>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>> and kvmppc_emulated_validate_tce (for the previous item)
>>> * fixed bug with failthrough for H_IPI
>>> * removed all get_user() from real mode handlers
>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>> ---
>>> Documentation/virtual/kvm/api.txt       |   17 ++
>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>>> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>> include/uapi/linux/kvm.h                |    1 +
>>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>> 
>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>> index 5f91eda..6c082ff 100644
>>> --- a/Documentation/virtual/kvm/api.txt
>>> +++ b/Documentation/virtual/kvm/api.txt
>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>>> handled.
>>> 
>>> 
>>> +4.83 KVM_CAP_PPC_MULTITCE
>>> +
>>> +Capability: KVM_CAP_PPC_MULTITCE
>>> +Architectures: ppc
>>> +Type: vm
>>> +
>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls
>>> +handling is supported by the kernel. This significanly accelerates DMA
>>> +operations for PPC KVM guests.
>>> +
>>> +Unlike other capabilities in this section, this one does not have an ioctl.
>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
>> 
> 
>> While this describes perfectly well what the consequences are of the
>> patches, it does not describe properly what the CAP actually expresses.
>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls directly". All other consequences are nice to
>> document, but the semantics of the CAP are missing.
> 
> 
> ? It expresses ability to handle 2 hcalls. What is missing?

You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap.

> 
> 
>> We also usually try to keep KVM behavior unchanged with regards to older
>> versions until a CAP is enabled. In this case I don't think it matters
>> all that much, so I'm fine with declaring it as enabled by default.
>> Please document that this is a change in behavior versus older KVM
>> versions though.
> 
> 
> Ok!
> 
> 
>>> +
>>> +
>>> 5. The kvm_run structure
>>> ------------------------
>>> 
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>> index af326cd..85d8f26 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>>> 	spinlock_t tbacct_lock;
>>> 	u64 busy_stolen;
>>> 	u64 busy_preempt;
>>> +
>>> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>>> #endif
>>> };
>> 
>> [...]
>>> 
>>> 
>> 
>> [...]
>> 
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 550f592..a39039a 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>> 			ret = kvmppc_xics_hcall(vcpu, req);
>>> 			break;
>>> 		} /* fallthrough */
>> 
>> The fallthrough comment isn't accurate anymore.
>> 
>>> +		return RESUME_HOST;
>>> +	case H_PUT_TCE:
>>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>> +						kvmppc_get_gpr(vcpu, 5),
>>> +						kvmppc_get_gpr(vcpu, 6));
>>> +		if (ret == H_TOO_HARD)
>>> +			return RESUME_HOST;
>>> +		break;
>>> +	case H_PUT_TCE_INDIRECT:
>>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>>> +						kvmppc_get_gpr(vcpu, 5),
>>> +						kvmppc_get_gpr(vcpu, 6),
>>> +						kvmppc_get_gpr(vcpu, 7));
>>> +		if (ret == H_TOO_HARD)
>>> +			return RESUME_HOST;
>>> +		break;
>>> +	case H_STUFF_TCE:
>>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>> +						kvmppc_get_gpr(vcpu, 5),
>>> +						kvmppc_get_gpr(vcpu, 6),
>>> +						kvmppc_get_gpr(vcpu, 7));
>>> +		if (ret == H_TOO_HARD)
>>> +			return RESUME_HOST;
>>> +		break;
>>> 	default:
>>> 		return RESUME_HOST;
>>> 	}
>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>>> 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>>> 	kvmppc_sanity_check(vcpu);
>>> 
>>> +	/*
>>> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
>>> +	 * half executed, we first read TCEs from the user, check them and
>>> +	 * return error if something went wrong and only then put TCEs into
>>> +	 * the TCE table.
>>> +	 *
>>> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
>>> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
>>> +	 * each (4096 bytes).
>>> +	 */
>>> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
>>> +	if (!vcpu->arch.tce_tmp)
>>> +		goto free_vcpu;
>>> +
>>> 	return vcpu;
>>> 
>>> free_vcpu:
>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>>> 	spin_unlock(&vcpu->arch.vpa_update_lock);
>>> +	kfree(vcpu->arch.tce_tmp);
>>> 	kvm_vcpu_uninit(vcpu);
>>> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
>>> }
>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> index b02f91e..d35554e 100644
>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>>> 	.long	0		/* 0x11c */
>>> 	.long	0		/* 0x120 */
>>> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
>>> +	.long	0		/* 0x128 */
>>> +	.long	0		/* 0x12c */
>>> +	.long	0		/* 0x130 */
>>> +	.long	0		/* 0x134 */
>>> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
>>> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>>> hcall_real_table_end:
>>> 
>>> ignore_hdec:
>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
>>> index da0e0bc..91d4b45 100644
>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>>> 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>> 	long rc;
>>> 
>>> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>>> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>>> +	if (rc == H_TOO_HARD)
>>> +		return EMULATE_FAIL;
>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>> +	return EMULATE_DONE;
>>> +}
>>> +
>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>>> +{
>>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>> +	long rc;
>>> +
>>> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>>> +			tce, npages);
>>> +	if (rc == H_TOO_HARD)
>>> +		return EMULATE_FAIL;
>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>> +	return EMULATE_DONE;
>>> +}
>>> +
>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>>> +{
>>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
>>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>> +	long rc;
>>> +
>>> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>>> 	if (rc == H_TOO_HARD)
>>> 		return EMULATE_FAIL;
>>> 	kvmppc_set_gpr(vcpu, 3, rc);
>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>>> 		return kvmppc_h_pr_bulk_remove(vcpu);
>>> 	case H_PUT_TCE:
>>> 		return kvmppc_h_pr_put_tce(vcpu);
>>> +	case H_PUT_TCE_INDIRECT:
>>> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
>>> +	case H_STUFF_TCE:
>>> +		return kvmppc_h_pr_stuff_tce(vcpu);
>>> 	case H_CEDE:
>>> 		vcpu->arch.shared->msr |= MSR_EE;
>>> 		kvm_vcpu_block(vcpu);
>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>> index 6316ee3..8465c2a 100644
>>> --- a/arch/powerpc/kvm/powerpc.c
>>> +++ b/arch/powerpc/kvm/powerpc.c
>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> 		r = 1;
>>> 		break;
>>> #endif
>>> +	case KVM_CAP_SPAPR_MULTITCE:
>>> +		r = 1;
>> 
>> This should only be true for book3s.
> 
> 
> We had this discussion with v2.
> 
> David:
> ===
> So, in the case of MULTITCE, that's not quite right.  PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.  We can't make it dependent on PAPR
> mode being selected, because that's enabled per-vcpu, whereas these
> capabilities are queried on the VM before the vcpus are created.
> ===
> 
> Wrong?

Partially. BookE can not emulate a PAPR system as it stands today.

The code should of course be generic and be available generically. But your code only patches the hypercall's availability in the book3s hypercall handlers, so that specific kernel version can only handle these hypercalls on book3s.

Whether a later version of the kernel will be able to handle them is a different question.


Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:02       ` Alexander Graf
@ 2013-06-17  8:34         ` Alexey Kardashevskiy
  2013-06-17  8:40           ` Alexander Graf
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-17  8:34 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

On 06/17/2013 06:02 PM, Alexander Graf wrote:
> 
> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
> 
>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>
>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>
>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>> devices or emulated PCI.  These calls allow adding multiple entries
>>>> (up to 512) into the TCE table in one call which saves time on
>>>> transition to/from real mode.
>>>>
>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>> (copied from user and verified) before writing the whole list into
>>>> the TCE table. This cache will be utilized more in the upcoming
>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>> mode in the case if the real mode handler failed for some reason.
>>>>
>>>> This adds a guest physical to host real address converter
>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>
>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>> so in order to support the functionality of this patch, QEMU
>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>> hypertas property only if the capability is present, otherwise
>>>> there will be serious performance degradation.
>>>>
>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>
>>> Only a few minor nits. Ben already commented on implementation details.
>>>
>>>>
>>>> ---
>>>> Changelog:
>>>> 2013/06/05:
>>>> * fixed mistype about IBMVIO in the commit message
>>>> * updated doc and moved it to another section
>>>> * changed capability number
>>>>
>>>> 2013/05/21:
>>>> * added kvm_vcpu_arch::tce_tmp
>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>> invalid
>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>> * fixed bug with failthrough for H_IPI
>>>> * removed all get_user() from real mode handlers
>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>> ---
>>>> Documentation/virtual/kvm/api.txt       |   17 ++
>>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>>>> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>>> include/uapi/linux/kvm.h                |    1 +
>>>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index 5f91eda..6c082ff 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>>>> handled.
>>>>
>>>>
>>>> +4.83 KVM_CAP_PPC_MULTITCE
>>>> +
>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>> +Architectures: ppc
>>>> +Type: vm
>>>> +
>>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls
>>>> +handling is supported by the kernel. This significanly accelerates DMA
>>>> +operations for PPC KVM guests.
>>>> +
>>>> +Unlike other capabilities in this section, this one does not have an ioctl.
>>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
>>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
>>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
>>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>
>>
>>> While this describes perfectly well what the consequences are of the
>>> patches, it does not describe properly what the CAP actually expresses.
>>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and
>>> H_STUFF_TCE hypercalls directly". All other consequences are nice to
>>> document, but the semantics of the CAP are missing.
>>
>>
>> ? It expresses ability to handle 2 hcalls. What is missing?
> 
> You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap.


This file does not mention qemu at all. And the interface is - qemu (or
kvmtool could do that) just adds "hcall-multi-tce" to
"ibm,hypertas-functions" but this is for pseries linux and AIX could always
do it (no idea about it). Does it really have to be in this file?



>>> We also usually try to keep KVM behavior unchanged with regards to older
>>> versions until a CAP is enabled. In this case I don't think it matters
>>> all that much, so I'm fine with declaring it as enabled by default.
>>> Please document that this is a change in behavior versus older KVM
>>> versions though.
>>
>>
>> Ok!
>>
>>
>>>> +
>>>> +
>>>> 5. The kvm_run structure
>>>> ------------------------
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index af326cd..85d8f26 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>>>> 	spinlock_t tbacct_lock;
>>>> 	u64 busy_stolen;
>>>> 	u64 busy_preempt;
>>>> +
>>>> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>>>> #endif
>>>> };
>>>
>>> [...]
>>>>
>>>>
>>>
>>> [...]
>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>>> index 550f592..a39039a 100644
>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>>> 			ret = kvmppc_xics_hcall(vcpu, req);
>>>> 			break;
>>>> 		} /* fallthrough */
>>>
>>> The fallthrough comment isn't accurate anymore.
>>>
>>>> +		return RESUME_HOST;
>>>> +	case H_PUT_TCE:
>>>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>> +						kvmppc_get_gpr(vcpu, 6));
>>>> +		if (ret == H_TOO_HARD)
>>>> +			return RESUME_HOST;
>>>> +		break;
>>>> +	case H_PUT_TCE_INDIRECT:
>>>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>> +						kvmppc_get_gpr(vcpu, 6),
>>>> +						kvmppc_get_gpr(vcpu, 7));
>>>> +		if (ret == H_TOO_HARD)
>>>> +			return RESUME_HOST;
>>>> +		break;
>>>> +	case H_STUFF_TCE:
>>>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>> +						kvmppc_get_gpr(vcpu, 6),
>>>> +						kvmppc_get_gpr(vcpu, 7));
>>>> +		if (ret == H_TOO_HARD)
>>>> +			return RESUME_HOST;
>>>> +		break;
>>>> 	default:
>>>> 		return RESUME_HOST;
>>>> 	}
>>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>>>> 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>>>> 	kvmppc_sanity_check(vcpu);
>>>>
>>>> +	/*
>>>> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
>>>> +	 * half executed, we first read TCEs from the user, check them and
>>>> +	 * return error if something went wrong and only then put TCEs into
>>>> +	 * the TCE table.
>>>> +	 *
>>>> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
>>>> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
>>>> +	 * each (4096 bytes).
>>>> +	 */
>>>> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
>>>> +	if (!vcpu->arch.tce_tmp)
>>>> +		goto free_vcpu;
>>>> +
>>>> 	return vcpu;
>>>>
>>>> free_vcpu:
>>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>>>> 	spin_unlock(&vcpu->arch.vpa_update_lock);
>>>> +	kfree(vcpu->arch.tce_tmp);
>>>> 	kvm_vcpu_uninit(vcpu);
>>>> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
>>>> }
>>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>> index b02f91e..d35554e 100644
>>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>>>> 	.long	0		/* 0x11c */
>>>> 	.long	0		/* 0x120 */
>>>> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
>>>> +	.long	0		/* 0x128 */
>>>> +	.long	0		/* 0x12c */
>>>> +	.long	0		/* 0x130 */
>>>> +	.long	0		/* 0x134 */
>>>> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
>>>> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>>>> hcall_real_table_end:
>>>>
>>>> ignore_hdec:
>>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
>>>> index da0e0bc..91d4b45 100644
>>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>>>> 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>>> 	long rc;
>>>>
>>>> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>>>> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>>>> +	if (rc == H_TOO_HARD)
>>>> +		return EMULATE_FAIL;
>>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>>> +	return EMULATE_DONE;
>>>> +}
>>>> +
>>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>>> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>>> +	long rc;
>>>> +
>>>> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>>>> +			tce, npages);
>>>> +	if (rc == H_TOO_HARD)
>>>> +		return EMULATE_FAIL;
>>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>>> +	return EMULATE_DONE;
>>>> +}
>>>> +
>>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>>> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
>>>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>>> +	long rc;
>>>> +
>>>> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>>>> 	if (rc == H_TOO_HARD)
>>>> 		return EMULATE_FAIL;
>>>> 	kvmppc_set_gpr(vcpu, 3, rc);
>>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>>>> 		return kvmppc_h_pr_bulk_remove(vcpu);
>>>> 	case H_PUT_TCE:
>>>> 		return kvmppc_h_pr_put_tce(vcpu);
>>>> +	case H_PUT_TCE_INDIRECT:
>>>> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
>>>> +	case H_STUFF_TCE:
>>>> +		return kvmppc_h_pr_stuff_tce(vcpu);
>>>> 	case H_CEDE:
>>>> 		vcpu->arch.shared->msr |= MSR_EE;
>>>> 		kvm_vcpu_block(vcpu);
>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>> index 6316ee3..8465c2a 100644
>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>>> 		r = 1;
>>>> 		break;
>>>> #endif
>>>> +	case KVM_CAP_SPAPR_MULTITCE:
>>>> +		r = 1;
>>>
>>> This should only be true for book3s.
>>
>>
>> We had this discussion with v2.
>>
>> David:
>> ===
>> So, in the case of MULTITCE, that's not quite right.  PR KVM can
>> emulate a PAPR system on a BookE machine, and there's no reason not to
>> allow TCE acceleration as well.  We can't make it dependent on PAPR
>> mode being selected, because that's enabled per-vcpu, whereas these
>> capabilities are queried on the VM before the vcpus are created.
>> ===
>>
>> Wrong?

> Partially. BookE can not emulate a PAPR system as it stands today.

Oh.
Ok.
So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)?




-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  7:55     ` Alexey Kardashevskiy
  2013-06-17  8:02       ` Alexander Graf
@ 2013-06-17  8:37       ` Benjamin Herrenschmidt
  2013-06-17  8:42         ` Alexander Graf
  1 sibling, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-17  8:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, linuxppc-dev, David Gibson, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc

On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote:
> David:
> ===
> So, in the case of MULTITCE, that's not quite right.  PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.  We can't make it dependent on PAPR
> mode being selected, because that's enabled per-vcpu, whereas these
> capabilities are queried on the VM before the vcpus are created.
> ===
> 
> Wrong?

The capability just tells qemu the kernel supports it, it doesn't have
to depend on PAPR mode, qemu can sort things out no ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:34         ` Alexey Kardashevskiy
@ 2013-06-17  8:40           ` Alexander Graf
  2013-06-17  8:51             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Alexander Graf @ 2013-06-17  8:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc


On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:

> On 06/17/2013 06:02 PM, Alexander Graf wrote:
>> 
>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
>> 
>>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>> 
>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>> 
>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>>> devices or emulated PCI.  These calls allow adding multiple entries
>>>>> (up to 512) into the TCE table in one call which saves time on
>>>>> transition to/from real mode.
>>>>> 
>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>> (copied from user and verified) before writing the whole list into
>>>>> the TCE table. This cache will be utilized more in the upcoming
>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>>> mode in the case if the real mode handler failed for some reason.
>>>>> 
>>>>> This adds a guest physical to host real address converter
>>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>> 
>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>>> so in order to support the functionality of this patch, QEMU
>>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>>> hypertas property only if the capability is present, otherwise
>>>>> there will be serious performance degradation.
>>>>> 
>>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>> 
>>>> Only a few minor nits. Ben already commented on implementation details.
>>>> 
>>>>> 
>>>>> ---
>>>>> Changelog:
>>>>> 2013/06/05:
>>>>> * fixed mistype about IBMVIO in the commit message
>>>>> * updated doc and moved it to another section
>>>>> * changed capability number
>>>>> 
>>>>> 2013/05/21:
>>>>> * added kvm_vcpu_arch::tce_tmp
>>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>>> invalid
>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>>> * fixed bug with failthrough for H_IPI
>>>>> * removed all get_user() from real mode handlers
>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>> ---
>>>>> Documentation/virtual/kvm/api.txt       |   17 ++
>>>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>>>>> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>>>> include/uapi/linux/kvm.h                |    1 +
>>>>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>>>> 
>>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>>> index 5f91eda..6c082ff 100644
>>>>> --- a/Documentation/virtual/kvm/api.txt
>>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>>>>> handled.
>>>>> 
>>>>> 
>>>>> +4.83 KVM_CAP_PPC_MULTITCE
>>>>> +
>>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>>> +Architectures: ppc
>>>>> +Type: vm
>>>>> +
>>>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls
>>>>> +handling is supported by the kernel. This significanly accelerates DMA
>>>>> +operations for PPC KVM guests.
>>>>> +
>>>>> +Unlike other capabilities in this section, this one does not have an ioctl.
>>>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
>>>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
>>>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
>>>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>> 
>>> 
>>>> While this describes perfectly well what the consequences are of the
>>>> patches, it does not describe properly what the CAP actually expresses.
>>>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and
>>>> H_STUFF_TCE hypercalls directly". All other consequences are nice to
>>>> document, but the semantics of the CAP are missing.
>>> 
>>> 
>>> ? It expresses ability to handle 2 hcalls. What is missing?
>> 
>> You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap.
> 
> 
> This file does not mention qemu at all. And the interface is - qemu (or
> kvmtool could do that) just adds "hcall-multi-tce" to
> "ibm,hypertas-functions" but this is for pseries linux and AIX could always
> do it (no idea about it). Does it really have to be in this file?

Ok, let's go back a step. What does this CAP describe? Don't look at the description you wrote above. Just write a new one. What exactly can user space expect when it finds this CAP?

> 
> 
> 
>>>> We also usually try to keep KVM behavior unchanged with regards to older
>>>> versions until a CAP is enabled. In this case I don't think it matters
>>>> all that much, so I'm fine with declaring it as enabled by default.
>>>> Please document that this is a change in behavior versus older KVM
>>>> versions though.
>>> 
>>> 
>>> Ok!
>>> 
>>> 
>>>>> +
>>>>> +
>>>>> 5. The kvm_run structure
>>>>> ------------------------
>>>>> 
>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>>> index af326cd..85d8f26 100644
>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>>>>> 	spinlock_t tbacct_lock;
>>>>> 	u64 busy_stolen;
>>>>> 	u64 busy_preempt;
>>>>> +
>>>>> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>>>>> #endif
>>>>> };
>>>> 
>>>> [...]
>>>>> 
>>>>> 
>>>> 
>>>> [...]
>>>> 
>>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>>>> index 550f592..a39039a 100644
>>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>>>> 			ret = kvmppc_xics_hcall(vcpu, req);
>>>>> 			break;
>>>>> 		} /* fallthrough */
>>>> 
>>>> The fallthrough comment isn't accurate anymore.
>>>> 
>>>>> +		return RESUME_HOST;
>>>>> +	case H_PUT_TCE:
>>>>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>>> +						kvmppc_get_gpr(vcpu, 6));
>>>>> +		if (ret == H_TOO_HARD)
>>>>> +			return RESUME_HOST;
>>>>> +		break;
>>>>> +	case H_PUT_TCE_INDIRECT:
>>>>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>>> +						kvmppc_get_gpr(vcpu, 6),
>>>>> +						kvmppc_get_gpr(vcpu, 7));
>>>>> +		if (ret == H_TOO_HARD)
>>>>> +			return RESUME_HOST;
>>>>> +		break;
>>>>> +	case H_STUFF_TCE:
>>>>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>>> +						kvmppc_get_gpr(vcpu, 6),
>>>>> +						kvmppc_get_gpr(vcpu, 7));
>>>>> +		if (ret == H_TOO_HARD)
>>>>> +			return RESUME_HOST;
>>>>> +		break;
>>>>> 	default:
>>>>> 		return RESUME_HOST;
>>>>> 	}
>>>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>>>>> 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>>>>> 	kvmppc_sanity_check(vcpu);
>>>>> 
>>>>> +	/*
>>>>> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
>>>>> +	 * half executed, we first read TCEs from the user, check them and
>>>>> +	 * return error if something went wrong and only then put TCEs into
>>>>> +	 * the TCE table.
>>>>> +	 *
>>>>> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
>>>>> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
>>>>> +	 * each (4096 bytes).
>>>>> +	 */
>>>>> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
>>>>> +	if (!vcpu->arch.tce_tmp)
>>>>> +		goto free_vcpu;
>>>>> +
>>>>> 	return vcpu;
>>>>> 
>>>>> free_vcpu:
>>>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>>>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>>>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>>>>> 	spin_unlock(&vcpu->arch.vpa_update_lock);
>>>>> +	kfree(vcpu->arch.tce_tmp);
>>>>> 	kvm_vcpu_uninit(vcpu);
>>>>> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
>>>>> }
>>>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>>> index b02f91e..d35554e 100644
>>>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>>>>> 	.long	0		/* 0x11c */
>>>>> 	.long	0		/* 0x120 */
>>>>> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
>>>>> +	.long	0		/* 0x128 */
>>>>> +	.long	0		/* 0x12c */
>>>>> +	.long	0		/* 0x130 */
>>>>> +	.long	0		/* 0x134 */
>>>>> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
>>>>> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>>>>> hcall_real_table_end:
>>>>> 
>>>>> ignore_hdec:
>>>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
>>>>> index da0e0bc..91d4b45 100644
>>>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>>>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>>>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>>>>> 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>>>> 	long rc;
>>>>> 
>>>>> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>>>>> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>>>>> +	if (rc == H_TOO_HARD)
>>>>> +		return EMULATE_FAIL;
>>>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>>>> +	return EMULATE_DONE;
>>>>> +}
>>>>> +
>>>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>>>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>>>> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>>>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>>>> +	long rc;
>>>>> +
>>>>> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>>>>> +			tce, npages);
>>>>> +	if (rc == H_TOO_HARD)
>>>>> +		return EMULATE_FAIL;
>>>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>>>> +	return EMULATE_DONE;
>>>>> +}
>>>>> +
>>>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>>>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>>>> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
>>>>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>>>> +	long rc;
>>>>> +
>>>>> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>>>>> 	if (rc == H_TOO_HARD)
>>>>> 		return EMULATE_FAIL;
>>>>> 	kvmppc_set_gpr(vcpu, 3, rc);
>>>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>>>>> 		return kvmppc_h_pr_bulk_remove(vcpu);
>>>>> 	case H_PUT_TCE:
>>>>> 		return kvmppc_h_pr_put_tce(vcpu);
>>>>> +	case H_PUT_TCE_INDIRECT:
>>>>> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
>>>>> +	case H_STUFF_TCE:
>>>>> +		return kvmppc_h_pr_stuff_tce(vcpu);
>>>>> 	case H_CEDE:
>>>>> 		vcpu->arch.shared->msr |= MSR_EE;
>>>>> 		kvm_vcpu_block(vcpu);
>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>> index 6316ee3..8465c2a 100644
>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>>>> 		r = 1;
>>>>> 		break;
>>>>> #endif
>>>>> +	case KVM_CAP_SPAPR_MULTITCE:
>>>>> +		r = 1;
>>>> 
>>>> This should only be true for book3s.
>>> 
>>> 
>>> We had this discussion with v2.
>>> 
>>> David:
>>> ===
>>> So, in the case of MULTITCE, that's not quite right.  PR KVM can
>>> emulate a PAPR system on a BookE machine, and there's no reason not to
>>> allow TCE acceleration as well.  We can't make it dependent on PAPR
>>> mode being selected, because that's enabled per-vcpu, whereas these
>>> capabilities are queried on the VM before the vcpus are created.
>>> ===
>>> 
>>> Wrong?
> 
>> Partially. BookE can not emulate a PAPR system as it stands today.
> 
> Oh.
> Ok.
> So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)?

The former.


Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:37       ` Benjamin Herrenschmidt
@ 2013-06-17  8:42         ` Alexander Graf
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Graf @ 2013-06-17  8:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras,
	kvm, linux-kernel, kvm-ppc


On 17.06.2013, at 10:37, Benjamin Herrenschmidt wrote:

> On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote:
>> David:
>> ===
>> So, in the case of MULTITCE, that's not quite right.  PR KVM can
>> emulate a PAPR system on a BookE machine, and there's no reason not to
>> allow TCE acceleration as well.  We can't make it dependent on PAPR
>> mode being selected, because that's enabled per-vcpu, whereas these
>> capabilities are queried on the VM before the vcpus are created.
>> ===
>> 
>> Wrong?
> 
> The capability just tells qemu the kernel supports it, it doesn't have
> to depend on PAPR mode, qemu can sort things out no ?

Yes, this goes hand-in-hand with the documentation bit I'm trying to get through to Alexey atm. The CAP merely says that if in PAPR mode the kernel can handle hypercalls X and Y itself.

This is true for all book3s implementations as the patches stand. It is not true for BookE as the patches stand. Hence the CAP should be limited to book3s, regardless of its mode :).


Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:40           ` Alexander Graf
@ 2013-06-17  8:51             ` Alexey Kardashevskiy
  2013-06-17 10:46               ` Alexander Graf
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-17  8:51 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

On 06/17/2013 06:40 PM, Alexander Graf wrote:
> 
> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:
> 
>> On 06/17/2013 06:02 PM, Alexander Graf wrote:
>>> 
>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
>>> 
>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>>> 
>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>>> 
>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and 
>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as
>>>>>> IBMVIO devices or emulated PCI.  These calls allow adding
>>>>>> multiple entries (up to 512) into the TCE table in one call
>>>>>> which saves time on transition to/from real mode.
>>>>>> 
>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs 
>>>>>> (copied from user and verified) before writing the whole list
>>>>>> into the TCE table. This cache will be utilized more in the
>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in
>>>>>> the virtual mode in the case if the real mode handler failed
>>>>>> for some reason.
>>>>>> 
>>>>>> This adds a guest physical to host real address converter and
>>>>>> calls the existing H_PUT_TCE handler. The converting function 
>>>>>> is going to be fully utilized by upcoming VFIO supporting
>>>>>> patches.
>>>>>> 
>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so
>>>>>> in order to support the functionality of this patch, QEMU 
>>>>>> needs to query for this capability and set the
>>>>>> "hcall-multi-tce" hypertas property only if the capability is
>>>>>> present, otherwise there will be serious performance
>>>>>> degradation.
>>>>>> 
>>>>>> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by:
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul
>>>>>> Mackerras <paulus@samba.org>
>>>>> 
>>>>> Only a few minor nits. Ben already commented on implementation
>>>>> details.
>>>>> 
>>>>>> 
>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the
>>>>>> commit message * updated doc and moved it to another section *
>>>>>> changed capability number
>>>>>> 
>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup
>>>>>> if put_indirect failed, instead we do not even start writing
>>>>>> to TCE table if we cannot get TCEs from the user and they are 
>>>>>> invalid * kvmppc_emulated_h_put_tce is split to
>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for
>>>>>> the previous item) * fixed bug with failthrough for H_IPI *
>>>>>> removed all get_user() from real mode handlers *
>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte
>>>>>> public) --- Documentation/virtual/kvm/api.txt       |   17 ++ 
>>>>>> arch/powerpc/include/asm/kvm_host.h     |    2 + 
>>>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +- 
>>>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++ 
>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266
>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c
>>>>>> |   39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 + 
>>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++- 
>>>>>> arch/powerpc/kvm/powerpc.c              |    3 + 
>>>>>> include/uapi/linux/kvm.h                |    1 + 10 files
>>>>>> changed, 473 insertions(+), 32 deletions(-)
>>>>>> 
>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff
>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++
>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@
>>>>>> calls by the guest for that service will be passed to
>>>>>> userspace to be handled.
>>>>>> 
>>>>>> 
>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability:
>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This
>>>>>> capability tells the guest that multiple TCE entry add/remove
>>>>>> hypercalls +handling is supported by the kernel. This
>>>>>> significanly accelerates DMA +operations for PPC KVM guests. 
>>>>>> + +Unlike other capabilities in this section, this one does
>>>>>> not have an ioctl. +Instead, when the capability is present,
>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be
>>>>>> handled in the host kernel and not passed to +the guest.
>>>>>> Othwerwise it might be better for the guest to continue using
>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or
>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>>> 
>>>> 
>>>>> While this describes perfectly well what the consequences are of
>>>>> the patches, it does not describe properly what the CAP actually
>>>>> expresses. The CAP only says "this kernel is able to handle
>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All
>>>>> other consequences are nice to document, but the semantics of
>>>>> the CAP are missing.
>>>> 
>>>> 
>>>> ? It expresses ability to handle 2 hcalls. What is missing?
>>> 
>>> You don't describe the kvm <-> qemu interface. You describe some
>>> decisions qemu can take from this cap.
>> 
>> 
>> This file does not mention qemu at all. And the interface is - qemu
>> (or kvmtool could do that) just adds "hcall-multi-tce" to 
>> "ibm,hypertas-functions" but this is for pseries linux and AIX could
>> always do it (no idea about it). Does it really have to be in this
>> file?
> 

> Ok, let's go back a step. What does this CAP describe? Don't look at the
> description you wrote above. Just write a new one.

The CAP means the kernel is capable of handling hcalls A and B without
passing those into the user space. That accelerates DMA.


> What exactly can user space expect when it finds this CAP?

The user space can expect that its handlers for A and B are not going to be
called if it configures the guest appropriately.

Any better? :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
  2013-06-16  4:26   ` Benjamin Herrenschmidt
  2013-06-16  4:31     ` Benjamin Herrenschmidt
@ 2013-06-17  9:17     ` Alexey Kardashevskiy
  1 sibling, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-17  9:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc, linux-mm

On 06/16/2013 02:26 PM, Benjamin Herrenschmidt wrote:
>> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
>> +int realmode_get_page(struct page *page)
>> +{
>> +	if (PageCompound(page))
>> +		return -EAGAIN;
>> +
>> +	get_page(page);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(realmode_get_page);
>> +
>> +int realmode_put_page(struct page *page)
>> +{
>> +	if (PageCompound(page))
>> +		return -EAGAIN;
>> +
>> +	if (!atomic_add_unless(&page->_count, -1, 1))
>> +		return -EAGAIN;
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(realmode_put_page);
>> +#endif
> 
> Several worries here, mostly that if the generic code ever changes
> (something gets added to get_page() that makes it no-longer safe for use
> in real mode for example, or some other condition gets added to
> put_page()), we go out of sync and potentially end up with very hard and
> very subtle bugs.
> 
> It might be worth making sure that:
> 
>  - This is reviewed by some generic VM people (and make sure they
> understand why we need to do that)
> 
>  - A comment is added to get_page() and put_page() to make sure that if
> they are changed in any way, dbl check the impact on our
> realmode_get_page() (or "ping" us to make sure things are still ok).

After changing get_page() to get_page_unless_zero(), the get_page API I use is:
get_page_unless_zero() - basically atomic_inc_not_zero()
atomic_add_unless() - just operated with the counter
PageCompound() - check if it is a huge page.

No usage of get_page or put_page.

If any of those changes, I would expect it to hit us immediately, no?

So it may only make sense to add a comment to PageCompound(). But the
comment says "PageCompound is generally not used in hot code paths", and
our path is hot. Heh.

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..c70a654 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,8 @@ static inline void set_page_writeback(struct page *page)
  * System with lots of page flags available. This allows separate
  * flags for PageHead() and PageTail() checks of compound pages so that bit
  * tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * which uses it to detect huge pages and avoid handling those in real mode.
  */
 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)


So?


-- 
Alexey

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:51             ` Alexey Kardashevskiy
@ 2013-06-17 10:46               ` Alexander Graf
  2013-06-17 10:48                 ` Alexander Graf
  0 siblings, 1 reply; 69+ messages in thread
From: Alexander Graf @ 2013-06-17 10:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote:
> On 06/17/2013 06:40 PM, Alexander Graf wrote:
>> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:
>>
>>> On 06/17/2013 06:02 PM, Alexander Graf wrote:
>>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
>>>>
>>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as
>>>>>>> IBMVIO devices or emulated PCI.  These calls allow adding
>>>>>>> multiple entries (up to 512) into the TCE table in one call
>>>>>>> which saves time on transition to/from real mode.
>>>>>>>
>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>>>> (copied from user and verified) before writing the whole list
>>>>>>> into the TCE table. This cache will be utilized more in the
>>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in
>>>>>>> the virtual mode in the case if the real mode handler failed
>>>>>>> for some reason.
>>>>>>>
>>>>>>> This adds a guest physical to host real address converter and
>>>>>>> calls the existing H_PUT_TCE handler. The converting function
>>>>>>> is going to be fully utilized by upcoming VFIO supporting
>>>>>>> patches.
>>>>>>>
>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so
>>>>>>> in order to support the functionality of this patch, QEMU
>>>>>>> needs to query for this capability and set the
>>>>>>> "hcall-multi-tce" hypertas property only if the capability is
>>>>>>> present, otherwise there will be serious performance
>>>>>>> degradation.
>>>>>>>
>>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au>  Signed-off-by:
>>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru>  Signed-off-by: Paul
>>>>>>> Mackerras<paulus@samba.org>
>>>>>> Only a few minor nits. Ben already commented on implementation
>>>>>> details.
>>>>>>
>>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the
>>>>>>> commit message * updated doc and moved it to another section *
>>>>>>> changed capability number
>>>>>>>
>>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup
>>>>>>> if put_indirect failed, instead we do not even start writing
>>>>>>> to TCE table if we cannot get TCEs from the user and they are
>>>>>>> invalid * kvmppc_emulated_h_put_tce is split to
>>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for
>>>>>>> the previous item) * fixed bug with failthrough for H_IPI *
>>>>>>> removed all get_user() from real mode handlers *
>>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte
>>>>>>> public) --- Documentation/virtual/kvm/api.txt       |   17 ++
>>>>>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>>>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>>>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266
>>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c
>>>>>>> |   39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>>>>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>>>>>> include/uapi/linux/kvm.h                |    1 + 10 files
>>>>>>> changed, 473 insertions(+), 32 deletions(-)
>>>>>>>
>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff
>>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++
>>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@
>>>>>>> calls by the guest for that service will be passed to
>>>>>>> userspace to be handled.
>>>>>>>
>>>>>>>
>>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability:
>>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This
>>>>>>> capability tells the guest that multiple TCE entry add/remove
>>>>>>> hypercalls +handling is supported by the kernel. This
>>>>>>> significanly accelerates DMA +operations for PPC KVM guests.
>>>>>>> + +Unlike other capabilities in this section, this one does
>>>>>>> not have an ioctl. +Instead, when the capability is present,
>>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be
>>>>>>> handled in the host kernel and not passed to +the guest.
>>>>>>> Othwerwise it might be better for the guest to continue using
>>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or
>>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>>>> While this describes perfectly well what the consequences are of
>>>>>> the patches, it does not describe properly what the CAP actually
>>>>>> expresses. The CAP only says "this kernel is able to handle
>>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All
>>>>>> other consequences are nice to document, but the semantics of
>>>>>> the CAP are missing.
>>>>>
>>>>> ? It expresses ability to handle 2 hcalls. What is missing?
>>>> You don't describe the kvm<->  qemu interface. You describe some
>>>> decisions qemu can take from this cap.
>>>
>>> This file does not mention qemu at all. And the interface is - qemu
>>> (or kvmtool could do that) just adds "hcall-multi-tce" to
>>> "ibm,hypertas-functions" but this is for pseries linux and AIX could
>>> always do it (no idea about it). Does it really have to be in this
>>> file?
>> Ok, let's go back a step. What does this CAP describe? Don't look at the
>> description you wrote above. Just write a new one.
> The CAP means the kernel is capable of handling hcalls A and B without
> passing those into the user space. That accelerates DMA.
>
>
>> What exactly can user space expect when it finds this CAP?
> The user space can expect that its handlers for A and B are not going to be
> called if it configures the guest appropriately.
>
> Any better? :)

A lot, yes. This is what the CAP actually means.

It's nice to give some guidance in the documentation of implications 
(should expose "ibm,hypertas-functions" to enable the guest to actually 
use these for example) but the first paragraph should only indicate what 
the CAP changes.


Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17 10:46               ` Alexander Graf
@ 2013-06-17 10:48                 ` Alexander Graf
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Graf @ 2013-06-17 10:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

On 06/17/2013 12:46 PM, Alexander Graf wrote:
> On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote:
>> On 06/17/2013 06:40 PM, Alexander Graf wrote:
>>> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:
>>>
>>>> On 06/17/2013 06:02 PM, Alexander Graf wrote:
>>>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
>>>>>
>>>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as
>>>>>>>> IBMVIO devices or emulated PCI.  These calls allow adding
>>>>>>>> multiple entries (up to 512) into the TCE table in one call
>>>>>>>> which saves time on transition to/from real mode.
>>>>>>>>
>>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>>>>> (copied from user and verified) before writing the whole list
>>>>>>>> into the TCE table. This cache will be utilized more in the
>>>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in
>>>>>>>> the virtual mode in the case if the real mode handler failed
>>>>>>>> for some reason.
>>>>>>>>
>>>>>>>> This adds a guest physical to host real address converter and
>>>>>>>> calls the existing H_PUT_TCE handler. The converting function
>>>>>>>> is going to be fully utilized by upcoming VFIO supporting
>>>>>>>> patches.
>>>>>>>>
>>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so
>>>>>>>> in order to support the functionality of this patch, QEMU
>>>>>>>> needs to query for this capability and set the
>>>>>>>> "hcall-multi-tce" hypertas property only if the capability is
>>>>>>>> present, otherwise there will be serious performance
>>>>>>>> degradation.
>>>>>>>>
>>>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au>  Signed-off-by:
>>>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru>  Signed-off-by: Paul
>>>>>>>> Mackerras<paulus@samba.org>
>>>>>>> Only a few minor nits. Ben already commented on implementation
>>>>>>> details.
>>>>>>>
>>>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the
>>>>>>>> commit message * updated doc and moved it to another section *
>>>>>>>> changed capability number
>>>>>>>>
>>>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup
>>>>>>>> if put_indirect failed, instead we do not even start writing
>>>>>>>> to TCE table if we cannot get TCEs from the user and they are
>>>>>>>> invalid * kvmppc_emulated_h_put_tce is split to
>>>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for
>>>>>>>> the previous item) * fixed bug with failthrough for H_IPI *
>>>>>>>> removed all get_user() from real mode handlers *
>>>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte
>>>>>>>> public) --- Documentation/virtual/kvm/api.txt       |   17 ++
>>>>>>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>>>>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>>>>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266
>>>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c
>>>>>>>> |   39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>>>>>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>>>>>>> include/uapi/linux/kvm.h                |    1 + 10 files
>>>>>>>> changed, 473 insertions(+), 32 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff
>>>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++
>>>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@
>>>>>>>> calls by the guest for that service will be passed to
>>>>>>>> userspace to be handled.
>>>>>>>>
>>>>>>>>
>>>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability:
>>>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This
>>>>>>>> capability tells the guest that multiple TCE entry add/remove
>>>>>>>> hypercalls +handling is supported by the kernel. This
>>>>>>>> significanly accelerates DMA +operations for PPC KVM guests.
>>>>>>>> + +Unlike other capabilities in this section, this one does
>>>>>>>> not have an ioctl. +Instead, when the capability is present,
>>>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be
>>>>>>>> handled in the host kernel and not passed to +the guest.
>>>>>>>> Othwerwise it might be better for the guest to continue using
>>>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or
>>>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>>>>> While this describes perfectly well what the consequences are of
>>>>>>> the patches, it does not describe properly what the CAP actually
>>>>>>> expresses. The CAP only says "this kernel is able to handle
>>>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All
>>>>>>> other consequences are nice to document, but the semantics of
>>>>>>> the CAP are missing.
>>>>>>
>>>>>> ? It expresses ability to handle 2 hcalls. What is missing?
>>>>> You don't describe the kvm<->  qemu interface. You describe some
>>>>> decisions qemu can take from this cap.
>>>>
>>>> This file does not mention qemu at all. And the interface is - qemu
>>>> (or kvmtool could do that) just adds "hcall-multi-tce" to
>>>> "ibm,hypertas-functions" but this is for pseries linux and AIX could
>>>> always do it (no idea about it). Does it really have to be in this
>>>> file?
>>> Ok, let's go back a step. What does this CAP describe? Don't look at 
>>> the
>>> description you wrote above. Just write a new one.
>> The CAP means the kernel is capable of handling hcalls A and B without
>> passing those into the user space. That accelerates DMA.
>>
>>
>>> What exactly can user space expect when it finds this CAP?
>> The user space can expect that its handlers for A and B are not going 
>> to be
>> called if it configures the guest appropriately.

Actually a nitpick here too. User space can expect that its handlers for 
A and B are going to already be processed by KVM. Regardless of how user 
space configures the guest.


Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
  2013-06-05  6:11 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy
  2013-06-16  4:46   ` Benjamin Herrenschmidt
@ 2013-06-17 16:35   ` Paolo Bonzini
  1 sibling, 0 replies; 69+ messages in thread
From: Paolo Bonzini @ 2013-06-17 16:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson,
	Alexander Graf, Paul Mackerras, kvm, linux-kernel, kvm-ppc

Il 05/06/2013 08:11, Alexey Kardashevskiy ha scritto:
> +/*
> + * The KVM guest can be backed with 16MB pages (qemu switch
> + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).

Nitpick: we try to avoid references to QEMU, so perhaps

s/qemu switch/for example, with QEMU you can use the command-line option/

Paolo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-17  3:56       ` Benjamin Herrenschmidt
@ 2013-06-18  2:32         ` Alex Williamson
  2013-06-18  4:38           ` Benjamin Herrenschmidt
  2013-06-19  3:35           ` Rusty Russell
  0 siblings, 2 replies; 69+ messages in thread
From: Alex Williamson @ 2013-06-18  2:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:
> 
> > IOMMU groups themselves don't provide security, they're accessed by
> > interfaces like VFIO, which provide the security.  Given a brief look, I
> > agree, this looks like a possible backdoor.  The typical VFIO way to
> > handle this would be to pass a VFIO file descriptor here to prove that
> > the process has access to the IOMMU group.  This is how /dev/vfio/vfio
> > gains the ability to setup an IOMMU domain an do mappings with the
> > SET_CONTAINER ioctl using a group fd.  Thanks,
> 
> How do you envision that in the kernel ? IE. I'm in KVM code, gets that
> vfio fd, what do I do with it ?
> 
> Basically, KVM needs to know that the user is allowed to use that iommu
> group. I don't think we want KVM however to call into VFIO directly
> right ?

Right, we don't want to create dependencies across modules.  I don't
have a vision for how this should work.  This is effectively a complete
side-band to vfio, so we're really just dealing in the iommu group
space.  Maybe there needs to be some kind of registration of ownership
for the group using some kind of token.  It would need to include some
kind of notification when that ownership ends.  That might also be a
convenient tag to toggle driver probing off for devices in the group.
Other ideas?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-18  2:32         ` Alex Williamson
@ 2013-06-18  4:38           ` Benjamin Herrenschmidt
  2013-06-18 14:48             ` Alex Williamson
  2013-06-19  3:35           ` Rusty Russell
  1 sibling, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-18  4:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell

On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote:

> Right, we don't want to create dependencies across modules.  I don't
> have a vision for how this should work.  This is effectively a complete
> side-band to vfio, so we're really just dealing in the iommu group
> space.  Maybe there needs to be some kind of registration of ownership
> for the group using some kind of token.  It would need to include some
> kind of notification when that ownership ends.  That might also be a
> convenient tag to toggle driver probing off for devices in the group.
> Other ideas?  Thanks,

All of that smells nasty like it will need a pile of bloody
infrastructure.... which makes me think it's too complicated and not the
right approach.

How does access control work today on x86/VFIO ? Can you give me a bit
more details ? I didn't get a good grasp in your previous email....

>From the look of it, the VFIO file descriptor is what has the "access
control" to the underlying iommu, is this right ? So we somewhat need to
transfer (or copy) that ownership from the VFIO fd to the KVM VM.

I don't see a way to do that without some cross-layering here...

Rusty, are you aware of some kernel mechanism we can use for that ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-18  4:38           ` Benjamin Herrenschmidt
@ 2013-06-18 14:48             ` Alex Williamson
  2013-06-18 21:58               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2013-06-18 14:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell

On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote:
> 
> > Right, we don't want to create dependencies across modules.  I don't
> > have a vision for how this should work.  This is effectively a complete
> > side-band to vfio, so we're really just dealing in the iommu group
> > space.  Maybe there needs to be some kind of registration of ownership
> > for the group using some kind of token.  It would need to include some
> > kind of notification when that ownership ends.  That might also be a
> > convenient tag to toggle driver probing off for devices in the group.
> > Other ideas?  Thanks,
> 
> All of that smells nasty like it will need a pile of bloody
> infrastructure.... which makes me think it's too complicated and not the
> right approach.
> 
> How does access control work today on x86/VFIO ? Can you give me a bit
> more details ? I didn't get a good grasp in your previous email....

The current model is not x86 specific, but it only covers doing iommu
and device access through vfio.  The kink here is that we're trying to
do device access and setup through vfio, but iommu manipulation through
kvm.  We may want to revisit whether we can do the in-kernel iommu
manipulation through vfio rather than kvm.

For vfio in general, the group is the unit of ownership.  A user is
granted access to /dev/vfio/$GROUP through file permissions.  The user
opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER
on the group.  If supported by the platform, multiple groups can be set
to the same container, which allows for iommu domain sharing.  Once a
group is associated with a container, an iommu backend can be
initialized for the container.  Only then can a device be accessed
through the group.

So even if we were to pass a vfio group file descriptor into kvm and it
matched as some kind of ownership token on the iommu group, it's not
clear that's sufficient to assume we can start programming the iommu.
Thanks,

Alex

> From the look of it, the VFIO file descriptor is what has the "access
> control" to the underlying iommu, is this right ? So we somewhat need to
> transfer (or copy) that ownership from the VFIO fd to the KVM VM.
> 
> I don't see a way to do that without some cross-layering here...
> 
> Rusty, are you aware of some kernel mechanism we can use for that ?
> 
> Cheers,
> Ben.
> 
> 




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-18 14:48             ` Alex Williamson
@ 2013-06-18 21:58               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-18 21:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell

On Tue, 2013-06-18 at 08:48 -0600, Alex Williamson wrote:
> On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote:
> > 
> > > Right, we don't want to create dependencies across modules.  I don't
> > > have a vision for how this should work.  This is effectively a complete
> > > side-band to vfio, so we're really just dealing in the iommu group
> > > space.  Maybe there needs to be some kind of registration of ownership
> > > for the group using some kind of token.  It would need to include some
> > > kind of notification when that ownership ends.  That might also be a
> > > convenient tag to toggle driver probing off for devices in the group.
> > > Other ideas?  Thanks,
> > 
> > All of that smells nasty like it will need a pile of bloody
> > infrastructure.... which makes me think it's too complicated and not the
> > right approach.
> > 
> > How does access control work today on x86/VFIO ? Can you give me a bit
> > more details ? I didn't get a good grasp in your previous email....
> 
> The current model is not x86 specific, but it only covers doing iommu
> and device access through vfio.  The kink here is that we're trying to
> do device access and setup through vfio, but iommu manipulation through
> kvm.  We may want to revisit whether we can do the in-kernel iommu
> manipulation through vfio rather than kvm.

How would that be possible ?

The hypercalls from the guest arrive in KVM... in a very very specific &
restricted environment which we call real mode (MMU off but still
running in guest context), where we try to do as much as possible, or in
virtual mode, where they get handled as normal KVM exits.

The only way we could handle them "in VFIO" would be if somewhat VFIO
registered callbacks with KVM... if we have that sort of
cross-dependency, then we may as well have a simpler one where VFIO
tells KVM what iommu is available for the VM

> For vfio in general, the group is the unit of ownership.  A user is
> granted access to /dev/vfio/$GROUP through file permissions.  The user
> opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER
> on the group.  If supported by the platform, multiple groups can be set
> to the same container, which allows for iommu domain sharing.  Once a
> group is associated with a container, an iommu backend can be
> initialized for the container.  Only then can a device be accessed
> through the group.
> 
> So even if we were to pass a vfio group file descriptor into kvm and it
> matched as some kind of ownership token on the iommu group, it's not
> clear that's sufficient to assume we can start programming the iommu.
> Thanks,

Your scheme seems to me that it would have the same problem if you
wanted to do virtualized iommu....

In any case, this is a big deal. We have a requirement for pass-through.
It cannot work with any remotely usable performance level if we don't
implement the calls in KVM, so it needs to be sorted one way or another
and I'm at a loss how here...

Ben.

> Alex
> 
> > From the look of it, the VFIO file descriptor is what has the "access
> > control" to the underlying iommu, is this right ? So we somewhat need to
> > transfer (or copy) that ownership from the VFIO fd to the KVM VM.
> > 
> > I don't see a way to do that without some cross-layering here...
> > 
> > Rusty, are you aware of some kernel mechanism we can use for that ?
> > 
> > Cheers,
> > Ben.
> > 
> > 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-16  4:39   ` Benjamin Herrenschmidt
@ 2013-06-19  3:17     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-19  3:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, David Gibson, Alexander Graf, Paul Mackerras, kvm,
	linux-kernel, kvm-ppc

On 06/16/2013 02:39 PM, Benjamin Herrenschmidt wrote:
>>  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>> -			unsigned long *pte_sizep)
>> +			unsigned long *pte_sizep, bool do_get_page)
>>  {
>>  	pte_t *ptep;
>>  	unsigned int shift = 0;
>> @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>>  	if (!pte_present(*ptep))
>>  		return __pte(0);
>>  
>> +	/*
>> +	 * Put huge pages handling to the virtual mode.
>> +	 * The only exception is for TCE list pages which we
>> +	 * do need to call get_page() for.
>> +	 */
>> +	if ((*pte_sizep > PAGE_SIZE) && do_get_page)
>> +		return __pte(0);
>> +
>>  	/* wait until _PAGE_BUSY is clear then set it atomically */
>>  	__asm__ __volatile__ (
>>  		"1:	ldarx	%0,0,%3\n"
>> @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>>  		: "cc");
>>  
>>  	ret = pte;
>> +	if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
>> +		struct page *pg = NULL;
>> +		pg = realmode_pfn_to_page(pte_pfn(pte));
>> +		if (realmode_get_page(pg)) {
>> +			ret = __pte(0);
>> +		} else {
>> +			pte = pte_mkyoung(pte);
>> +			if (writing)
>> +				pte = pte_mkdirty(pte);
>> +		}
>> +	}
>> +	*ptep = pte;	/* clears _PAGE_BUSY */
>>  
>>  	return ret;
>>  }
> 
> So now you are adding the clearing of _PAGE_BUSY that was missing for
> your first patch, except that this is not enough since that means that
> in the "emulated" case (ie, !do_get_page) you will in essence return
> and then use a PTE that is not locked without any synchronization to
> ensure that the underlying page doesn't go away... then you'll
> dereference that page.
> 
> So either make everything use speculative get_page, or make the emulated
> case use the MMU notifier to drop the operation in case of collision.
> 
> The former looks easier.
> 
> Also, any specific reason why you do:
> 
>   - Lock the PTE
>   - get_page()
>   - Unlock the PTE
> 
> Instead of
> 
>   - Read the PTE
>   - get_page_unless_zero
>   - re-check PTE
> 
> Like get_user_pages_fast() does ?
> 
> The former will be two atomic ops, the latter only one (faster), but
> maybe you have a good reason why that can't work...



If we want to set "dirty" and "young" bits for pte then I do not know how
to avoid _PAGE_BUSY.



-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-18  2:32         ` Alex Williamson
  2013-06-18  4:38           ` Benjamin Herrenschmidt
@ 2013-06-19  3:35           ` Rusty Russell
  2013-06-19  4:59             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 69+ messages in thread
From: Rusty Russell @ 2013-06-19  3:35 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc

Alex Williamson <alex.williamson@redhat.com> writes:
> On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote:
>> On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:
>> 
>> > IOMMU groups themselves don't provide security, they're accessed by
>> > interfaces like VFIO, which provide the security.  Given a brief look, I
>> > agree, this looks like a possible backdoor.  The typical VFIO way to
>> > handle this would be to pass a VFIO file descriptor here to prove that
>> > the process has access to the IOMMU group.  This is how /dev/vfio/vfio
>> > gains the ability to setup an IOMMU domain an do mappings with the
>> > SET_CONTAINER ioctl using a group fd.  Thanks,
>> 
>> How do you envision that in the kernel ? IE. I'm in KVM code, gets that
>> vfio fd, what do I do with it ?
>> 
>> Basically, KVM needs to know that the user is allowed to use that iommu
>> group. I don't think we want KVM however to call into VFIO directly
>> right ?
>
> Right, we don't want to create dependencies across modules.  I don't
> have a vision for how this should work.  This is effectively a complete
> side-band to vfio, so we're really just dealing in the iommu group
> space.  Maybe there needs to be some kind of registration of ownership
> for the group using some kind of token.  It would need to include some
> kind of notification when that ownership ends.  That might also be a
> convenient tag to toggle driver probing off for devices in the group.
> Other ideas?  Thanks,

It's actually not that bad.

eg. 

struct vfio_container *vfio_container_from_file(struct file *filp)
{
        if (filp->f_op != &vfio_device_fops)
                return ERR_PTR(-EINVAL);

        /* OK it really is a vfio fd, return the data. */
        ....
}
EXPORT_SYMBOL_GPL(vfio_container_from_file);

...

inside KVM_CREATE_SPAPR_TCE_IOMMU:

        struct file *vfio_filp;
        struct vfio_container *(lookup)(struct file *filp);

        vfio_filp = fget(create_tce_iommu.fd);
        if (!vfio)
                ret = -EBADF;
        lookup = symbol_get(vfio_container_from_file);
        if (!lookup)
                ret = -EINVAL;
        else {
                container = lookup(vfio_filp);
                if (IS_ERR(container))
                        ret = PTR_ERR(container);
                else
                        ...
                symbol_put(vfio_container_from_file);
        }

symbol_get() won't try to load a module; it'll just fail.  This is what
you want, since they must have vfio in the kernel to get a valid fd...

Hope that helps,
Rusty.
                

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-19  3:35           ` Rusty Russell
@ 2013-06-19  4:59             ` Benjamin Herrenschmidt
  2013-06-19  9:58               ` Alexander Graf
  0 siblings, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-19  4:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Alexander Graf,
	Paul Mackerras, kvm, linux-kernel, kvm-ppc, Rusty Russell

On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote:
> symbol_get() won't try to load a module; it'll just fail.  This is what
> you want, since they must have vfio in the kernel to get a valid fd...

Ok, cool. I suppose what we want here Alexey is slightly higher level,
something like:

	vfio_validate_iommu_id(file, iommu_id)

Which verifies that the file that was passed in is allowed to use
that iommu_id.

That's a simple and flexible interface (ie, it will work even if we
support multiple iommu IDs in the future for a vfio, for example
for DDW windows etc...), the logic to know about the ID remains
in qemu, this is strictly a validation call.

That way we also don't have to expose the containing vfio struct etc...
just that simple function.

Alex, any objection ?

Do we need to make it a get/put interface instead ?

	vfio_validate_and_use_iommu(file, iommu_id);

	vfio_release_iommu(file, iommu_id);

To ensure that the resource remains owned by the process until KVM
is closed as well ?

Or do we want to register with VFIO with a callback so that VFIO can
call us if it needs us to give it up ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-19  4:59             ` Benjamin Herrenschmidt
@ 2013-06-19  9:58               ` Alexander Graf
  2013-06-19 14:50                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 69+ messages in thread
From: Alexander Graf @ 2013-06-19  9:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Alexey Kardashevskiy, linuxppc-dev,
	David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel


On 19.06.2013, at 06:59, Benjamin Herrenschmidt wrote:

> On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote:
>> symbol_get() won't try to load a module; it'll just fail.  This is what
>> you want, since they must have vfio in the kernel to get a valid fd...
> 
> Ok, cool. I suppose what we want here Alexey is slightly higher level,
> something like:
> 
> 	vfio_validate_iommu_id(file, iommu_id)
> 
> Which verifies that the file that was passed in is allowed to use
> that iommu_id.
> 
> That's a simple and flexible interface (ie, it will work even if we
> support multiple iommu IDs in the future for a vfio, for example
> for DDW windows etc...), the logic to know about the ID remains
> in qemu, this is strictly a validation call.
> 
> That way we also don't have to expose the containing vfio struct etc...
> just that simple function.
> 
> Alex, any objection ?

Which Alex? :)

I think validate works, it keeps iteration logic out of the kernel which is a good thing. There still needs to be an interface for getting the iommu id in VFIO, but I suppose that one's for the other Alex and Jörg to comment on.

> 
> Do we need to make it a get/put interface instead ?
> 
> 	vfio_validate_and_use_iommu(file, iommu_id);
> 
> 	vfio_release_iommu(file, iommu_id);
> 
> To ensure that the resource remains owned by the process until KVM
> is closed as well ?
> 
> Or do we want to register with VFIO with a callback so that VFIO can
> call us if it needs us to give it up ?

Can't we just register a handler on the fd and get notified when it closes? Can you kill VFIO access without closing the fd?


Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-19  9:58               ` Alexander Graf
@ 2013-06-19 14:50                 ` Benjamin Herrenschmidt
  2013-06-19 15:49                   ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-19 14:50 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Alex Williamson, Alexey Kardashevskiy, linuxppc-dev,
	David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:

> > Alex, any objection ?
> 
> Which Alex? :)

Heh, mostly Williamson in this specific case but your input is still
welcome :-)

> I think validate works, it keeps iteration logic out of the kernel
> which is a good thing. There still needs to be an interface for
> getting the iommu id in VFIO, but I suppose that one's for the other
> Alex and Jörg to comment on.

I think getting the iommu fd is already covered by separate patches from
Alexey.

> > 
> > Do we need to make it a get/put interface instead ?
> > 
> > 	vfio_validate_and_use_iommu(file, iommu_id);
> > 
> > 	vfio_release_iommu(file, iommu_id);
> > 
> > To ensure that the resource remains owned by the process until KVM
> > is closed as well ?
> > 
> > Or do we want to register with VFIO with a callback so that VFIO can
> > call us if it needs us to give it up ?
> 
> Can't we just register a handler on the fd and get notified when it
> closes? Can you kill VFIO access without closing the fd?

That sounds actually harder :-)

The question is basically: When we validate that relationship between a
specific VFIO struct file with an iommu, what is the lifetime of that
and how do we handle this lifetime properly.

There's two ways for that sort of situation: The notification model
where we get notified when the relationship is broken, and the refcount
model where we become a "user" and thus delay the breaking of the
relationship until we have been disposed of as well.

In this specific case, it's hard to tell what is the right model from my
perspective, which is why I would welcome Alex (W.) input.

In the end, the solution will end up being in the form of APIs exposed
by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
as owner of VFIO at this stage, what do you want those to look
like ? :-)

Cheers,
Ben.
 
> 
> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-19 14:50                 ` Benjamin Herrenschmidt
@ 2013-06-19 15:49                   ` Alex Williamson
  2013-06-20  4:58                     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2013-06-19 15:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Graf, Alexey Kardashevskiy, linuxppc-dev, David Gibson,
	Paul Mackerras, kvm@vger.kernel.org mailing list, open list,
	kvm-ppc, Rusty Russell, Joerg Roedel

On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:
> 
> > > Alex, any objection ?
> > 
> > Which Alex? :)
> 
> Heh, mostly Williamson in this specific case but your input is still
> welcome :-)
> 
> > I think validate works, it keeps iteration logic out of the kernel
> > which is a good thing. There still needs to be an interface for
> > getting the iommu id in VFIO, but I suppose that one's for the other
> > Alex and Jörg to comment on.
> 
> I think getting the iommu fd is already covered by separate patches from
> Alexey.
> 
> > > 
> > > Do we need to make it a get/put interface instead ?
> > > 
> > > 	vfio_validate_and_use_iommu(file, iommu_id);
> > > 
> > > 	vfio_release_iommu(file, iommu_id);
> > > 
> > > To ensure that the resource remains owned by the process until KVM
> > > is closed as well ?
> > > 
> > > Or do we want to register with VFIO with a callback so that VFIO can
> > > call us if it needs us to give it up ?
> > 
> > Can't we just register a handler on the fd and get notified when it
> > closes? Can you kill VFIO access without closing the fd?
> 
> That sounds actually harder :-)
> 
> The question is basically: When we validate that relationship between a
> specific VFIO struct file with an iommu, what is the lifetime of that
> and how do we handle this lifetime properly.
> 
> There's two ways for that sort of situation: The notification model
> where we get notified when the relationship is broken, and the refcount
> model where we become a "user" and thus delay the breaking of the
> relationship until we have been disposed of as well.
> 
> In this specific case, it's hard to tell what is the right model from my
> perspective, which is why I would welcome Alex (W.) input.
> 
> In the end, the solution will end up being in the form of APIs exposed
> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
> as owner of VFIO at this stage, what do you want those to look
> like ? :-)

My first thought is that we should use the same reference counting as we
have for vfio devices (group->container_users).  An interface for that
might look like:

int vfio_group_add_external_user(struct file *filep)
{
	struct vfio_group *group = filep->private_data;

	if (filep->f_op != &vfio_group_fops)
		return -EINVAL;


	if (!atomic_inc_not_zero(&group->container_users))
		return -EINVAL;

	return 0;
}

void vfio_group_del_external_user(struct file *filep)
{
	struct vfio_group *group = filep->private_data;

	BUG_ON(filep->f_op != &vfio_group_fops);

	vfio_group_try_dissolve_container(group);
}

int vfio_group_iommu_id_from_file(struct file *filep)
{
	struct vfio_group *group = filep->private_data;

	BUG_ON(filep->f_op != &vfio_group_fops);

	return iommu_group_id(group->iommu_group);
}

Would that work?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-19 15:49                   ` Alex Williamson
@ 2013-06-20  4:58                     ` Alexey Kardashevskiy
  2013-06-20  5:28                       ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-20  4:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, Alexander Graf, linuxppc-dev,
	David Gibson, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On 06/20/2013 01:49 AM, Alex Williamson wrote:
> On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote:
>> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:
>>
>>>> Alex, any objection ?
>>>
>>> Which Alex? :)
>>
>> Heh, mostly Williamson in this specific case but your input is still
>> welcome :-)
>>
>>> I think validate works, it keeps iteration logic out of the kernel
>>> which is a good thing. There still needs to be an interface for
>>> getting the iommu id in VFIO, but I suppose that one's for the other
>>> Alex and Jörg to comment on.
>>
>> I think getting the iommu fd is already covered by separate patches from
>> Alexey.
>>
>>>>
>>>> Do we need to make it a get/put interface instead ?
>>>>
>>>> 	vfio_validate_and_use_iommu(file, iommu_id);
>>>>
>>>> 	vfio_release_iommu(file, iommu_id);
>>>>
>>>> To ensure that the resource remains owned by the process until KVM
>>>> is closed as well ?
>>>>
>>>> Or do we want to register with VFIO with a callback so that VFIO can
>>>> call us if it needs us to give it up ?
>>>
>>> Can't we just register a handler on the fd and get notified when it
>>> closes? Can you kill VFIO access without closing the fd?
>>
>> That sounds actually harder :-)
>>
>> The question is basically: When we validate that relationship between a
>> specific VFIO struct file with an iommu, what is the lifetime of that
>> and how do we handle this lifetime properly.
>>
>> There's two ways for that sort of situation: The notification model
>> where we get notified when the relationship is broken, and the refcount
>> model where we become a "user" and thus delay the breaking of the
>> relationship until we have been disposed of as well.
>>
>> In this specific case, it's hard to tell what is the right model from my
>> perspective, which is why I would welcome Alex (W.) input.
>>
>> In the end, the solution will end up being in the form of APIs exposed
>> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
>> as owner of VFIO at this stage, what do you want those to look
>> like ? :-)
> 
> My first thought is that we should use the same reference counting as we
> have for vfio devices (group->container_users).  An interface for that
> might look like:
> 
> int vfio_group_add_external_user(struct file *filep)
> {
> 	struct vfio_group *group = filep->private_data;
> 
> 	if (filep->f_op != &vfio_group_fops)
> 		return -EINVAL;
> 
> 
> 	if (!atomic_inc_not_zero(&group->container_users))
> 		return -EINVAL;
> 
> 	return 0;
> }
> 
> void vfio_group_del_external_user(struct file *filep)
> {
> 	struct vfio_group *group = filep->private_data;
> 
> 	BUG_ON(filep->f_op != &vfio_group_fops);
> 
> 	vfio_group_try_dissolve_container(group);
> }
> 
> int vfio_group_iommu_id_from_file(struct file *filep)
> {
> 	struct vfio_group *group = filep->private_data;
> 
> 	BUG_ON(filep->f_op != &vfio_group_fops);
> 
> 	return iommu_group_id(group->iommu_group);
> }
> 
> Would that work?  Thanks,


Just out of curiosity - would not get_file() and fput_atomic() on a group's
file* do the right job instead of vfio_group_add_external_user() and
vfio_group_del_external_user()?



-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-20  4:58                     ` Alexey Kardashevskiy
@ 2013-06-20  5:28                       ` David Gibson
  2013-06-20  7:47                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 69+ messages in thread
From: David Gibson @ 2013-06-20  5:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Benjamin Herrenschmidt, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 3827 bytes --]

On Thu, Jun 20, 2013 at 02:58:18PM +1000, Alexey Kardashevskiy wrote:
> On 06/20/2013 01:49 AM, Alex Williamson wrote:
> > On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote:
> >> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:
> >>
> >>>> Alex, any objection ?
> >>>
> >>> Which Alex? :)
> >>
> >> Heh, mostly Williamson in this specific case but your input is still
> >> welcome :-)
> >>
> >>> I think validate works, it keeps iteration logic out of the kernel
> >>> which is a good thing. There still needs to be an interface for
> >>> getting the iommu id in VFIO, but I suppose that one's for the other
> >>> Alex and Jörg to comment on.
> >>
> >> I think getting the iommu fd is already covered by separate patches from
> >> Alexey.
> >>
> >>>>
> >>>> Do we need to make it a get/put interface instead ?
> >>>>
> >>>> 	vfio_validate_and_use_iommu(file, iommu_id);
> >>>>
> >>>> 	vfio_release_iommu(file, iommu_id);
> >>>>
> >>>> To ensure that the resource remains owned by the process until KVM
> >>>> is closed as well ?
> >>>>
> >>>> Or do we want to register with VFIO with a callback so that VFIO can
> >>>> call us if it needs us to give it up ?
> >>>
> >>> Can't we just register a handler on the fd and get notified when it
> >>> closes? Can you kill VFIO access without closing the fd?
> >>
> >> That sounds actually harder :-)
> >>
> >> The question is basically: When we validate that relationship between a
> >> specific VFIO struct file with an iommu, what is the lifetime of that
> >> and how do we handle this lifetime properly.
> >>
> >> There's two ways for that sort of situation: The notification model
> >> where we get notified when the relationship is broken, and the refcount
> >> model where we become a "user" and thus delay the breaking of the
> >> relationship until we have been disposed of as well.
> >>
> >> In this specific case, it's hard to tell what is the right model from my
> >> perspective, which is why I would welcome Alex (W.) input.
> >>
> >> In the end, the solution will end up being in the form of APIs exposed
> >> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
> >> as owner of VFIO at this stage, what do you want those to look
> >> like ? :-)
> > 
> > My first thought is that we should use the same reference counting as we
> > have for vfio devices (group->container_users).  An interface for that
> > might look like:
> > 
> > int vfio_group_add_external_user(struct file *filep)
> > {
> > 	struct vfio_group *group = filep->private_data;
> > 
> > 	if (filep->f_op != &vfio_group_fops)
> > 		return -EINVAL;
> > 
> > 
> > 	if (!atomic_inc_not_zero(&group->container_users))
> > 		return -EINVAL;
> > 
> > 	return 0;
> > }
> > 
> > void vfio_group_del_external_user(struct file *filep)
> > {
> > 	struct vfio_group *group = filep->private_data;
> > 
> > 	BUG_ON(filep->f_op != &vfio_group_fops);
> > 
> > 	vfio_group_try_dissolve_container(group);
> > }
> > 
> > int vfio_group_iommu_id_from_file(struct file *filep)
> > {
> > 	struct vfio_group *group = filep->private_data;
> > 
> > 	BUG_ON(filep->f_op != &vfio_group_fops);
> > 
> > 	return iommu_group_id(group->iommu_group);
> > }
> > 
> > Would that work?  Thanks,
> 
> 
> Just out of curiosity - would not get_file() and fput_atomic() on a group's
> file* do the right job instead of vfio_group_add_external_user() and
> vfio_group_del_external_user()?

I was thinking that too.  Grabbing a file reference would certainly be
the usual way of handling this sort of thing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-20  5:28                       ` David Gibson
@ 2013-06-20  7:47                         ` Benjamin Herrenschmidt
  2013-06-20  8:48                           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-20  7:47 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Alex Williamson, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > Just out of curiosity - would not get_file() and fput_atomic() on a
> group's
> > file* do the right job instead of vfio_group_add_external_user() and
> > vfio_group_del_external_user()?
> 
> I was thinking that too.  Grabbing a file reference would certainly be
> the usual way of handling this sort of thing.

But that wouldn't prevent the group ownership to be returned to
the kernel or another user would it ?

Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-20  7:47                         ` Benjamin Herrenschmidt
@ 2013-06-20  8:48                           ` Alexey Kardashevskiy
  2013-06-20 14:55                             ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-20  8:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, Alex Williamson, Alexander Graf, linuxppc-dev,
	Paul Mackerras, kvm@vger.kernel.org mailing list, open list,
	kvm-ppc, Rusty Russell, Joerg Roedel

On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
>>> Just out of curiosity - would not get_file() and fput_atomic() on a
>> group's
>>> file* do the right job instead of vfio_group_add_external_user() and
>>> vfio_group_del_external_user()?
>>
>> I was thinking that too.  Grabbing a file reference would certainly be
>> the usual way of handling this sort of thing.
> 
> But that wouldn't prevent the group ownership to be returned to
> the kernel or another user would it ?


Holding the file pointer does not let the group->container_users counter go
to zero and this is exactly what vfio_group_add_external_user() and
vfio_group_del_external_user() do. The difference is only in absolute value
- 2 vs. 3.

No change in behaviour whether I use new vfio API or simply hold file* till
KVM closes fd created when IOMMU was connected to LIOBN.

And while this counter is not zero, QEMU cannot take ownership over the group.

I am definitely still missing the bigger picture...


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-20  8:48                           ` Alexey Kardashevskiy
@ 2013-06-20 14:55                             ` Alex Williamson
  2013-06-22  8:25                               ` Alexey Kardashevskiy
  2013-06-22 12:03                               ` David Gibson
  0 siblings, 2 replies; 69+ messages in thread
From: Alex Williamson @ 2013-06-20 14:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, David Gibson, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> >> group's
> >>> file* do the right job instead of vfio_group_add_external_user() and
> >>> vfio_group_del_external_user()?
> >>
> >> I was thinking that too.  Grabbing a file reference would certainly be
> >> the usual way of handling this sort of thing.
> > 
> > But that wouldn't prevent the group ownership to be returned to
> > the kernel or another user would it ?
> 
> 
> Holding the file pointer does not let the group->container_users counter go
> to zero

How so?  Holding the file pointer means the file won't go away, which
means the group release function won't be called.  That means the group
won't go away, but that doesn't mean it's attached to an IOMMU.  A user
could call UNSET_CONTAINER.

>  and this is exactly what vfio_group_add_external_user() and
> vfio_group_del_external_user() do. The difference is only in absolute value
> - 2 vs. 3.
> 
> No change in behaviour whether I use new vfio API or simply hold file* till
> KVM closes fd created when IOMMU was connected to LIOBN.

By that notion you could open(/dev/vfio/$GROUP) and you're safe, right?
But what about SET_CONTAINER & SET_IOMMU?  All that you guarantee
holding the file pointer is that the vfio_group exists.

> And while this counter is not zero, QEMU cannot take ownership over the group.
>
> I am definitely still missing the bigger picture...

The bigger picture is that the group needs to exist AND it needs to be
setup and maintained to have IOMMU protection.  Actually, my first stab
at add_external_user doesn't look sufficient, it needs to look more like
vfio_group_get_device_fd, checking group->container->iommu and
group_viable().  As written it would allow an external user after
SET_CONTAINER without SET_IOMMU.  It should also be part of the API that
the external user must hold the file reference between add_external_use
and del_external_user and do cleanup on any exit paths.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-20 14:55                             ` Alex Williamson
@ 2013-06-22  8:25                               ` Alexey Kardashevskiy
  2013-06-22 12:03                               ` David Gibson
  1 sibling, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-22  8:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, David Gibson, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On 06/21/2013 12:55 AM, Alex Williamson wrote:
> On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
>> On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
>>> On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
>>>>> Just out of curiosity - would not get_file() and fput_atomic() on a
>>>> group's
>>>>> file* do the right job instead of vfio_group_add_external_user() and
>>>>> vfio_group_del_external_user()?
>>>>
>>>> I was thinking that too.  Grabbing a file reference would certainly be
>>>> the usual way of handling this sort of thing.
>>>
>>> But that wouldn't prevent the group ownership to be returned to
>>> the kernel or another user would it ?
>>
>>
>> Holding the file pointer does not let the group->container_users counter go
>> to zero
> 
> How so?  Holding the file pointer means the file won't go away, which
> means the group release function won't be called.  That means the group
> won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> could call UNSET_CONTAINER.
> 
>>  and this is exactly what vfio_group_add_external_user() and
>> vfio_group_del_external_user() do. The difference is only in absolute value
>> - 2 vs. 3.
>>
>> No change in behaviour whether I use new vfio API or simply hold file* till
>> KVM closes fd created when IOMMU was connected to LIOBN.
> 
> By that notion you could open(/dev/vfio/$GROUP) and you're safe, right?
> But what about SET_CONTAINER & SET_IOMMU?  All that you guarantee
> holding the file pointer is that the vfio_group exists.
> 
>> And while this counter is not zero, QEMU cannot take ownership over the group.
>>
>> I am definitely still missing the bigger picture...
> 
> The bigger picture is that the group needs to exist AND it needs to be
> setup and maintained to have IOMMU protection.  Actually, my first stab
> at add_external_user doesn't look sufficient, it needs to look more like
> vfio_group_get_device_fd, checking group->container->iommu and
> group_viable().


This makes sense. If you did this, that would be great. Without it, I
really cannot see how the proposed inc/dec of container_users is better
than simple holding file*. Thanks.


> As written it would allow an external user after
> SET_CONTAINER without SET_IOMMU.  It should also be part of the API that
> the external user must hold the file reference between add_external_use
> and del_external_user and do cleanup on any exit paths.  Thanks,


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-20 14:55                             ` Alex Williamson
  2013-06-22  8:25                               ` Alexey Kardashevskiy
@ 2013-06-22 12:03                               ` David Gibson
  2013-06-22 14:28                                 ` Alex Williamson
  2013-06-22 23:28                                 ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 69+ messages in thread
From: David Gibson @ 2013-06-22 12:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1950 bytes --]

On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > >> group's
> > >>> file* do the right job instead of vfio_group_add_external_user() and
> > >>> vfio_group_del_external_user()?
> > >>
> > >> I was thinking that too.  Grabbing a file reference would certainly be
> > >> the usual way of handling this sort of thing.
> > > 
> > > But that wouldn't prevent the group ownership to be returned to
> > > the kernel or another user would it ?
> > 
> > 
> > Holding the file pointer does not let the group->container_users counter go
> > to zero
> 
> How so?  Holding the file pointer means the file won't go away, which
> means the group release function won't be called.  That means the group
> won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> could call UNSET_CONTAINER.

Uhh... *thinks*.  Ah, I see.

I think the interface should not take the group fd, but the container
fd.  Holding a reference to *that* would keep the necessary things
around.  But more to the point, it's the right thing semantically:

The container is essentially the handle on a host iommu address space,
and so that's what should be bound by the KVM call to a particular
guest iommu address space.  e.g. it would make no sense to bind two
different groups to different guest iommu address spaces, if they were
in the same container - the guest thinks they are different spaces,
but if they're in the same container they must be the same space.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-22 12:03                               ` David Gibson
@ 2013-06-22 14:28                                 ` Alex Williamson
  2013-06-24  3:52                                   ` David Gibson
  2013-06-22 23:28                                 ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2013-06-22 14:28 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > > >> group's
> > > >>> file* do the right job instead of vfio_group_add_external_user() and
> > > >>> vfio_group_del_external_user()?
> > > >>
> > > >> I was thinking that too.  Grabbing a file reference would certainly be
> > > >> the usual way of handling this sort of thing.
> > > > 
> > > > But that wouldn't prevent the group ownership to be returned to
> > > > the kernel or another user would it ?
> > > 
> > > 
> > > Holding the file pointer does not let the group->container_users counter go
> > > to zero
> > 
> > How so?  Holding the file pointer means the file won't go away, which
> > means the group release function won't be called.  That means the group
> > won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> > could call UNSET_CONTAINER.
> 
> Uhh... *thinks*.  Ah, I see.
> 
> I think the interface should not take the group fd, but the container
> fd.  Holding a reference to *that* would keep the necessary things
> around.  But more to the point, it's the right thing semantically:
> 
> The container is essentially the handle on a host iommu address space,
> and so that's what should be bound by the KVM call to a particular
> guest iommu address space.  e.g. it would make no sense to bind two
> different groups to different guest iommu address spaces, if they were
> in the same container - the guest thinks they are different spaces,
> but if they're in the same container they must be the same space.

While the container is the gateway to the iommu, what empowers the
container to maintain an iommu is the group.  What happens to a
container when all the groups are disconnected or closed?  Groups are
the unit that indicates hardware access, not containers.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-22 12:03                               ` David Gibson
  2013-06-22 14:28                                 ` Alex Williamson
@ 2013-06-22 23:28                                 ` Benjamin Herrenschmidt
  2013-06-24  3:54                                   ` David Gibson
  1 sibling, 1 reply; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-22 23:28 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> I think the interface should not take the group fd, but the container
> fd.  Holding a reference to *that* would keep the necessary things
> around.  But more to the point, it's the right thing semantically:
> 
> The container is essentially the handle on a host iommu address space,
> and so that's what should be bound by the KVM call to a particular
> guest iommu address space.  e.g. it would make no sense to bind two
> different groups to different guest iommu address spaces, if they were
> in the same container - the guest thinks they are different spaces,
> but if they're in the same container they must be the same space.

Interestingly, how are we going to extend that when/if we implement
DDW ?

DDW means an API by which the guest can request the creation of
additional iommus for a given device (typically, in addition to the
default smallish 32-bit one using 4k pages, the guest can request
a larger window in 64-bit space using a larger page size).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-22 14:28                                 ` Alex Williamson
@ 2013-06-24  3:52                                   ` David Gibson
  2013-06-24  4:41                                     ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: David Gibson @ 2013-06-24  3:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 2795 bytes --]

On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote:
> On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > > > >> group's
> > > > >>> file* do the right job instead of vfio_group_add_external_user() and
> > > > >>> vfio_group_del_external_user()?
> > > > >>
> > > > >> I was thinking that too.  Grabbing a file reference would certainly be
> > > > >> the usual way of handling this sort of thing.
> > > > > 
> > > > > But that wouldn't prevent the group ownership to be returned to
> > > > > the kernel or another user would it ?
> > > > 
> > > > 
> > > > Holding the file pointer does not let the group->container_users counter go
> > > > to zero
> > > 
> > > How so?  Holding the file pointer means the file won't go away, which
> > > means the group release function won't be called.  That means the group
> > > won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> > > could call UNSET_CONTAINER.
> > 
> > Uhh... *thinks*.  Ah, I see.
> > 
> > I think the interface should not take the group fd, but the container
> > fd.  Holding a reference to *that* would keep the necessary things
> > around.  But more to the point, it's the right thing semantically:
> > 
> > The container is essentially the handle on a host iommu address space,
> > and so that's what should be bound by the KVM call to a particular
> > guest iommu address space.  e.g. it would make no sense to bind two
> > different groups to different guest iommu address spaces, if they were
> > in the same container - the guest thinks they are different spaces,
> > but if they're in the same container they must be the same space.
> 
> While the container is the gateway to the iommu, what empowers the
> container to maintain an iommu is the group.  What happens to a
> container when all the groups are disconnected or closed?  Groups are
> the unit that indicates hardware access, not containers.  Thanks,

Uh... huh?  I'm really not sure what you're getting at.

The operation we're doing for KVM here is binding a guest iommu
address space to a particular host iommu address space.  Why would we
not want to use the obvious handle on the host iommu address space,
which is the container fd?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-22 23:28                                 ` Benjamin Herrenschmidt
@ 2013-06-24  3:54                                   ` David Gibson
  2013-06-24  3:58                                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 69+ messages in thread
From: David Gibson @ 2013-06-24  3:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1457 bytes --]

On Sun, Jun 23, 2013 at 09:28:13AM +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> > I think the interface should not take the group fd, but the container
> > fd.  Holding a reference to *that* would keep the necessary things
> > around.  But more to the point, it's the right thing semantically:
> > 
> > The container is essentially the handle on a host iommu address space,
> > and so that's what should be bound by the KVM call to a particular
> > guest iommu address space.  e.g. it would make no sense to bind two
> > different groups to different guest iommu address spaces, if they were
> > in the same container - the guest thinks they are different spaces,
> > but if they're in the same container they must be the same space.
> 
> Interestingly, how are we going to extend that when/if we implement
> DDW ?
> 
> DDW means an API by which the guest can request the creation of
> additional iommus for a given device (typically, in addition to the
> default smallish 32-bit one using 4k pages, the guest can request
> a larger window in 64-bit space using a larger page size).

So, would a PAPR gest requesting this expect the new window to have
a new liobn, or an existing liobn?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-24  3:54                                   ` David Gibson
@ 2013-06-24  3:58                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-24  3:58 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Alexey Kardashevskiy, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On Mon, 2013-06-24 at 13:54 +1000, David Gibson wrote:
> > DDW means an API by which the guest can request the creation of
> > additional iommus for a given device (typically, in addition to the
> > default smallish 32-bit one using 4k pages, the guest can request
> > a larger window in 64-bit space using a larger page size).
> 
> So, would a PAPR gest requesting this expect the new window to have
> a new liobn, or an existing liobn?

New liobn or there is no way to H_PUT_TCE it (it exists in addition
to the legacy window).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-24  3:52                                   ` David Gibson
@ 2013-06-24  4:41                                     ` Alex Williamson
  2013-06-27 11:01                                       ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2013-06-24  4:41 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote:
> On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote:
> > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > > > > >> group's
> > > > > >>> file* do the right job instead of vfio_group_add_external_user() and
> > > > > >>> vfio_group_del_external_user()?
> > > > > >>
> > > > > >> I was thinking that too.  Grabbing a file reference would certainly be
> > > > > >> the usual way of handling this sort of thing.
> > > > > > 
> > > > > > But that wouldn't prevent the group ownership to be returned to
> > > > > > the kernel or another user would it ?
> > > > > 
> > > > > 
> > > > > Holding the file pointer does not let the group->container_users counter go
> > > > > to zero
> > > > 
> > > > How so?  Holding the file pointer means the file won't go away, which
> > > > means the group release function won't be called.  That means the group
> > > > won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> > > > could call UNSET_CONTAINER.
> > > 
> > > Uhh... *thinks*.  Ah, I see.
> > > 
> > > I think the interface should not take the group fd, but the container
> > > fd.  Holding a reference to *that* would keep the necessary things
> > > around.  But more to the point, it's the right thing semantically:
> > > 
> > > The container is essentially the handle on a host iommu address space,
> > > and so that's what should be bound by the KVM call to a particular
> > > guest iommu address space.  e.g. it would make no sense to bind two
> > > different groups to different guest iommu address spaces, if they were
> > > in the same container - the guest thinks they are different spaces,
> > > but if they're in the same container they must be the same space.
> > 
> > While the container is the gateway to the iommu, what empowers the
> > container to maintain an iommu is the group.  What happens to a
> > container when all the groups are disconnected or closed?  Groups are
> > the unit that indicates hardware access, not containers.  Thanks,
> 
> Uh... huh?  I'm really not sure what you're getting at.
> 
> The operation we're doing for KVM here is binding a guest iommu
> address space to a particular host iommu address space.  Why would we
> not want to use the obvious handle on the host iommu address space,
> which is the container fd?

AIUI, the request isn't for an interface through which to do iommu
mappings.  The request is for an interface to show that the user has
sufficient privileges to do mappings.  Groups are what gives the user
that ability.  The iommu is also possibly associated with multiple iommu
groups and I believe what is being asked for here is a way to hold and
lock a single iommu group with iommu protection.

>From a practical point of view, the iommu interface is de-privileged
once the groups are disconnected or closed.  Holding a reference count
on the iommu fd won't prevent that.  That means we'd have to use a
notifier to have KVM stop the side-channel iommu access.  Meanwhile
holding the file descriptor for the group and adding an interface that
bumps use counter allows KVM to lock itself in, just as if it had a
device opened itself.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-06-24  4:41                                     ` Alex Williamson
@ 2013-06-27 11:01                                       ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2013-06-27 11:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alexander Graf,
	linuxppc-dev, Paul Mackerras, kvm@vger.kernel.org mailing list,
	open list, kvm-ppc, Rusty Russell, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 4030 bytes --]

On Sun, Jun 23, 2013 at 10:41:24PM -0600, Alex Williamson wrote:
> On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote:
> > On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote:
> > > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> > > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> > > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > > > > > >> group's
> > > > > > >>> file* do the right job instead of vfio_group_add_external_user() and
> > > > > > >>> vfio_group_del_external_user()?
> > > > > > >>
> > > > > > >> I was thinking that too.  Grabbing a file reference would certainly be
> > > > > > >> the usual way of handling this sort of thing.
> > > > > > > 
> > > > > > > But that wouldn't prevent the group ownership to be returned to
> > > > > > > the kernel or another user would it ?
> > > > > > 
> > > > > > 
> > > > > > Holding the file pointer does not let the group->container_users counter go
> > > > > > to zero
> > > > > 
> > > > > How so?  Holding the file pointer means the file won't go away, which
> > > > > means the group release function won't be called.  That means the group
> > > > > won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> > > > > could call UNSET_CONTAINER.
> > > > 
> > > > Uhh... *thinks*.  Ah, I see.
> > > > 
> > > > I think the interface should not take the group fd, but the container
> > > > fd.  Holding a reference to *that* would keep the necessary things
> > > > around.  But more to the point, it's the right thing semantically:
> > > > 
> > > > The container is essentially the handle on a host iommu address space,
> > > > and so that's what should be bound by the KVM call to a particular
> > > > guest iommu address space.  e.g. it would make no sense to bind two
> > > > different groups to different guest iommu address spaces, if they were
> > > > in the same container - the guest thinks they are different spaces,
> > > > but if they're in the same container they must be the same space.
> > > 
> > > While the container is the gateway to the iommu, what empowers the
> > > container to maintain an iommu is the group.  What happens to a
> > > container when all the groups are disconnected or closed?  Groups are
> > > the unit that indicates hardware access, not containers.  Thanks,
> > 
> > Uh... huh?  I'm really not sure what you're getting at.
> > 
> > The operation we're doing for KVM here is binding a guest iommu
> > address space to a particular host iommu address space.  Why would we
> > not want to use the obvious handle on the host iommu address space,
> > which is the container fd?
> 
> AIUI, the request isn't for an interface through which to do iommu
> mappings.  The request is for an interface to show that the user has
> sufficient privileges to do mappings.  Groups are what gives the user
> that ability.  The iommu is also possibly associated with multiple iommu
> groups and I believe what is being asked for here is a way to hold and
> lock a single iommu group with iommu protection.
> 
> >From a practical point of view, the iommu interface is de-privileged
> once the groups are disconnected or closed.  Holding a reference count
> on the iommu fd won't prevent that.  That means we'd have to use a
> notifier to have KVM stop the side-channel iommu access.  Meanwhile
> holding the file descriptor for the group and adding an interface that
> bumps use counter allows KVM to lock itself in, just as if it had a
> device opened itself.  Thanks,

Ah, good point.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29 23:29                       ` Alexey Kardashevskiy
@ 2013-05-29 23:32                         ` Scott Wood
  0 siblings, 0 replies; 69+ messages in thread
From: Scott Wood @ 2013-05-29 23:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/29/2013 06:29:13 PM, Alexey Kardashevskiy wrote:
> On 05/30/2013 09:14 AM, Scott Wood wrote:
> > On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
> >> On 05/30/2013 06:05 AM, Scott Wood wrote:
> >> > But you didn't put it in the same section as  
> KVM_CREATE_SPAPR_TCE.  0xe0
> >> > begins a different section.
> >>
> >> It is not really obvious that there are sections as no comment  
> defines
> >> those :)
> >
> > There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE  
> */
> >
> > Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with  
> the
> > ioctl number conflict mess in the vm-ioctl section, but at least  
> that one
> > is related to the device control API. :-)
> >
> >> But yes, makes sense to move it up a bit and change the code to  
> 0xad.
> >
> > 0xad is KVM_KVMCLOCK_CTRL
> 
> That's it. I am _completely_ confused now. No system whatsoever :(
> What rule should I use in order to choose the number for my new  
> ioctl? :)

Yeah, it's a mess.  0xaf seems to be free. :-)

-Scott

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29 23:14                     ` Scott Wood
@ 2013-05-29 23:29                       ` Alexey Kardashevskiy
  2013-05-29 23:32                         ` Scott Wood
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-29 23:29 UTC (permalink / raw)
  To: Scott Wood
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/30/2013 09:14 AM, Scott Wood wrote:
> On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
>> On 05/30/2013 06:05 AM, Scott Wood wrote:
>> > On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
>> >> On 05/29/2013 09:35 AM, Scott Wood wrote:
>> >> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
>> >> >> >> >>> kvm_device_attr)
>> >> >> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
>> >> >> >> >>> kvm_device_attr)
>> >> >> >> >>>
>> >> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>> >> >> >> >>> kvm_create_spapr_tce_iommu)
>> >> >> >> >>
>> >> >> >> >> Shouldn't this go under the vm ioctl section?
>> >> >> >>
>> >> >> >>
>> >> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> >> >> devices) is
>> >> >> >> in this section so I decided to keep them together. Wrong?
>> >> >> >
>> >> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> >> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
>> >> >>
>> >> >> Yes.
>> >> >
>> >> > Sigh.  That's the same thing repeated.  There's only one IOCTL.
>> >> Nothing is
>> >> > being "kept together".
>> >>
>> >> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
>> >
>> > But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.  0xe0
>> > begins a different section.
>>
>> It is not really obvious that there are sections as no comment defines
>> those :)
> 
> There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE */
> 
> Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with the
> ioctl number conflict mess in the vm-ioctl section, but at least that one
> is related to the device control API. :-)
> 
>> But yes, makes sense to move it up a bit and change the code to 0xad.
> 
> 0xad is KVM_KVMCLOCK_CTRL

That's it. I am _completely_ confused now. No system whatsoever :(
What rule should I use in order to choose the number for my new ioctl? :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29 23:10                   ` Alexey Kardashevskiy
@ 2013-05-29 23:14                     ` Scott Wood
  2013-05-29 23:29                       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Scott Wood @ 2013-05-29 23:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
> On 05/30/2013 06:05 AM, Scott Wood wrote:
> > On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
> >> On 05/29/2013 09:35 AM, Scott Wood wrote:
> >> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2,  
> struct
> >> >> >> >>> kvm_device_attr)
> >> >> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3,  
> struct
> >> >> >> >>> kvm_device_attr)
> >> >> >> >>>
> >> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4,  
> struct
> >> >> >> >>> kvm_create_spapr_tce_iommu)
> >> >> >> >>
> >> >> >> >> Shouldn't this go under the vm ioctl section?
> >> >> >>
> >> >> >>
> >> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for  
> emulated
> >> >> devices) is
> >> >> >> in this section so I decided to keep them together. Wrong?
> >> >> >
> >> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> >> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
> >> >>
> >> >> Yes.
> >> >
> >> > Sigh.  That's the same thing repeated.  There's only one IOCTL.
> >> Nothing is
> >> > being "kept together".
> >>
> >> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
> >
> > But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.   
> 0xe0
> > begins a different section.
> 
> It is not really obvious that there are sections as no comment defines
> those :)

There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE */

Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with the  
ioctl number conflict mess in the vm-ioctl section, but at least that  
one is related to the device control API. :-)

> But yes, makes sense to move it up a bit and change the code to 0xad.

0xad is KVM_KVMCLOCK_CTRL

-Scott

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29 20:05                 ` Scott Wood
@ 2013-05-29 23:10                   ` Alexey Kardashevskiy
  2013-05-29 23:14                     ` Scott Wood
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-29 23:10 UTC (permalink / raw)
  To: Scott Wood
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/30/2013 06:05 AM, Scott Wood wrote:
> On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
>> On 05/29/2013 09:35 AM, Scott Wood wrote:
>> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
>> >> >> >>> kvm_device_attr)
>> >> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
>> >> >> >>> kvm_device_attr)
>> >> >> >>>
>> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>> >> >> >>> kvm_create_spapr_tce_iommu)
>> >> >> >>
>> >> >> >> Shouldn't this go under the vm ioctl section?
>> >> >>
>> >> >>
>> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> >> devices) is
>> >> >> in this section so I decided to keep them together. Wrong?
>> >> >
>> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
>> >>
>> >> Yes.
>> >
>> > Sigh.  That's the same thing repeated.  There's only one IOCTL. 
>> Nothing is
>> > being "kept together".
>>
>> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
> 
> But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.  0xe0
> begins a different section.

It is not really obvious that there are sections as no comment defines
those :) But yes, makes sense to move it up a bit and change the code to 0xad.



-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29  0:12               ` Alexey Kardashevskiy
@ 2013-05-29 20:05                 ` Scott Wood
  2013-05-29 23:10                   ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Scott Wood @ 2013-05-29 20:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
> On 05/29/2013 09:35 AM, Scott Wood wrote:
> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
> >> >> >>> kvm_device_attr)
> >> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
> >> >> >>> kvm_device_attr)
> >> >> >>>
> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4,  
> struct
> >> >> >>> kvm_create_spapr_tce_iommu)
> >> >> >>
> >> >> >> Shouldn't this go under the vm ioctl section?
> >> >>
> >> >>
> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
> >> devices) is
> >> >> in this section so I decided to keep them together. Wrong?
> >> >
> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
> >>
> >> Yes.
> >
> > Sigh.  That's the same thing repeated.  There's only one IOCTL.   
> Nothing is
> > being "kept together".
> 
> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.

But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.   
0xe0 begins a different section.

-Scott

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-28 16:32       ` Scott Wood
@ 2013-05-29  0:20         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-29  0:20 UTC (permalink / raw)
  To: Scott Wood
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/29/2013 02:32 AM, Scott Wood wrote:
> On 05/24/2013 09:45:24 PM, David Gibson wrote:
>> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> > On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>> > >diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> > >index 8465c2a..da6bf61 100644
>> > >--- a/arch/powerpc/kvm/powerpc.c
>> > >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>> > >+++ b/arch/powerpc/kvm/powerpc.c
>> > >         break;
>> > > #endif
>> > >     case KVM_CAP_SPAPR_MULTITCE:
>> > >+    case KVM_CAP_SPAPR_TCE_IOMMU:
>> > >         r = 1;
>> > >         break;
>> > >     default:
>> >
>> > Don't advertise SPAPR capabilities if it's not book3s -- and
>> > probably there's some additional limitation that would be
>> > appropriate.
>>
>> So, in the case of MULTITCE, that's not quite right.  PR KVM can
>> emulate a PAPR system on a BookE machine, and there's no reason not to
>> allow TCE acceleration as well.
> 
> That might (or might not; consider things like Altivec versus SPE opcode
> conflict, whether unimplemented SPRs trap, behavior of unprivileged
> SPRs/instructions, etc) be true in theory, but it's not currently a
> supported configuration.  BookE KVM does not support emulating a different
> CPU than the host.  In the unlikely case that ever changes to the point of
> allowing PAPR guests on a BookE host, then we can revisit how to properly
> determine whether the capability is supported, but for now the capability
> will never be valid in the CONFIG_BOOKE case (though I'd rather see it
> depend on an appropriate book3s symbol than depend on !BOOKE).
> 
> Or we could just leave it as is, and let it indicate whether the host
> kernel supports the feature in general, with the user needing to understand
> when it's applicable...  I'm a bit confused by the documentation, however
> -- the MULTITCE capability was documented in the "capabilities that can be
> enabled" section, but I don't see where it can be enabled.


True, it cannot be enabled (but it could be enabled a long time ago), it is
either supported or not, I'll fix the documentation. Thanks!


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-28 23:35             ` Scott Wood
@ 2013-05-29  0:12               ` Alexey Kardashevskiy
  2013-05-29 20:05                 ` Scott Wood
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-29  0:12 UTC (permalink / raw)
  To: Scott Wood
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/29/2013 09:35 AM, Scott Wood wrote:
> On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
>> >> >>> kvm_device_attr)
>> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
>> >> >>> kvm_device_attr)
>> >> >>>
>> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>> >> >>> kvm_create_spapr_tce_iommu)
>> >> >>
>> >> >> Shouldn't this go under the vm ioctl section?
>> >>
>> >>
>> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> devices) is
>> >> in this section so I decided to keep them together. Wrong?
>> >
>> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> > KVM_CREATE_SPAPR_TCE_IOMMU?
>>
>> Yes.
> 
> Sigh.  That's the same thing repeated.  There's only one IOCTL.  Nothing is
> being "kept together".

Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-28 23:30           ` Alexey Kardashevskiy
@ 2013-05-28 23:35             ` Scott Wood
  2013-05-29  0:12               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Scott Wood @ 2013-05-28 23:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
> >> >>> kvm_device_attr)
> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
> >> >>> kvm_device_attr)
> >> >>>
> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
> >> >>> kvm_create_spapr_tce_iommu)
> >> >>
> >> >> Shouldn't this go under the vm ioctl section?
> >>
> >>
> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated  
> devices) is
> >> in this section so I decided to keep them together. Wrong?
> >
> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> > KVM_CREATE_SPAPR_TCE_IOMMU?
> 
> Yes.

Sigh.  That's the same thing repeated.  There's only one IOCTL.   
Nothing is being "kept together".

-Scott

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-28 17:45         ` Scott Wood
@ 2013-05-28 23:30           ` Alexey Kardashevskiy
  2013-05-28 23:35             ` Scott Wood
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-28 23:30 UTC (permalink / raw)
  To: Scott Wood
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/29/2013 03:45 AM, Scott Wood wrote:
> On 05/26/2013 09:44:24 PM, Alexey Kardashevskiy wrote:
>> On 05/25/2013 12:45 PM, David Gibson wrote:
>> > On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> >> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>> >>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> >>> index 8465c2a..da6bf61 100644
>> >>> --- a/arch/powerpc/kvm/powerpc.c
>> >>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>> >>> +++ b/arch/powerpc/kvm/powerpc.c
>> >>>         break;
>> >>> #endif
>> >>>     case KVM_CAP_SPAPR_MULTITCE:
>> >>> +    case KVM_CAP_SPAPR_TCE_IOMMU:
>> >>>         r = 1;
>> >>>         break;
>> >>>     default:
>> >>
>> >> Don't advertise SPAPR capabilities if it's not book3s -- and
>> >> probably there's some additional limitation that would be
>> >> appropriate.
>> >
>> > So, in the case of MULTITCE, that's not quite right.  PR KVM can
>> > emulate a PAPR system on a BookE machine, and there's no reason not to
>> > allow TCE acceleration as well.  We can't make it dependent on PAPR
>> > mode being selected, because that's enabled per-vcpu, whereas these
>> > capabilities are queried on the VM before the vcpus are created.
>> >
>> > CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
>> > host side hardware (i.e. a PAPR style IOMMU), though.
>>
>>
>> The capability says that the ioctl is supported. If there is no IOMMU group
>> registered, than it will fail with a reasonable error and nobody gets hurt.
>> What is the problem?
> 
> You could say that about a lot of the capabilities that just advertise the
> existence of new ioctls. :-)
> 
> Sometimes it's nice to know in advance whether it's supported, before
> actually requesting that something happen.

Yes, would be nice. There is just no quick way to know if this real system
supports IOMMU groups. I could add another helper to generic IOMMU code
which would return the number of registered IOMMU groups but it is a bit
too much :)


>> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
>> >>> kvm_device_attr)
>> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
>> >>> kvm_device_attr)
>> >>>
>> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>> >>> kvm_create_spapr_tce_iommu)
>> >>
>> >> Shouldn't this go under the vm ioctl section?
>>
>>
>> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is
>> in this section so I decided to keep them together. Wrong?
> 
> You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> KVM_CREATE_SPAPR_TCE_IOMMU?

Yes.


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-27  2:44       ` Alexey Kardashevskiy
@ 2013-05-28 17:45         ` Scott Wood
  2013-05-28 23:30           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Scott Wood @ 2013-05-28 17:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/26/2013 09:44:24 PM, Alexey Kardashevskiy wrote:
> On 05/25/2013 12:45 PM, David Gibson wrote:
> > On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> >> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> >>> diff --git a/arch/powerpc/kvm/powerpc.c  
> b/arch/powerpc/kvm/powerpc.c
> >>> index 8465c2a..da6bf61 100644
> >>> --- a/arch/powerpc/kvm/powerpc.c
> >>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >>> +++ b/arch/powerpc/kvm/powerpc.c
> >>> 		break;
> >>> #endif
> >>> 	case KVM_CAP_SPAPR_MULTITCE:
> >>> +	case KVM_CAP_SPAPR_TCE_IOMMU:
> >>> 		r = 1;
> >>> 		break;
> >>> 	default:
> >>
> >> Don't advertise SPAPR capabilities if it's not book3s -- and
> >> probably there's some additional limitation that would be
> >> appropriate.
> >
> > So, in the case of MULTITCE, that's not quite right.  PR KVM can
> > emulate a PAPR system on a BookE machine, and there's no reason not  
> to
> > allow TCE acceleration as well.  We can't make it dependent on PAPR
> > mode being selected, because that's enabled per-vcpu, whereas these
> > capabilities are queried on the VM before the vcpus are created.
> >
> > CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
> > host side hardware (i.e. a PAPR style IOMMU), though.
> 
> 
> The capability says that the ioctl is supported. If there is no IOMMU  
> group
> registered, than it will fail with a reasonable error and nobody gets  
> hurt.
> What is the problem?

You could say that about a lot of the capabilities that just advertise  
the existence of new ioctls. :-)

Sometimes it's nice to know in advance whether it's supported, before  
actually requesting that something happen.

> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >>> #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct
> >>> kvm_device_attr)
> >>> #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct
> >>> kvm_device_attr)
> >>>
> >>> +/* ioctl for SPAPR TCE IOMMU */
> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
> >>> kvm_create_spapr_tce_iommu)
> >>
> >> Shouldn't this go under the vm ioctl section?
> 
> 
> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated  
> devices) is
> in this section so I decided to keep them together. Wrong?

You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with  
KVM_CREATE_SPAPR_TCE_IOMMU?

-Scott

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-25  2:45     ` David Gibson
  2013-05-27  2:44       ` Alexey Kardashevskiy
  2013-05-27 10:23       ` Paolo Bonzini
@ 2013-05-28 16:32       ` Scott Wood
  2013-05-29  0:20         ` Alexey Kardashevskiy
  2 siblings, 1 reply; 69+ messages in thread
From: Scott Wood @ 2013-05-28 16:32 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/24/2013 09:45:24 PM, David Gibson wrote:
> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> > On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> > >diff --git a/arch/powerpc/kvm/powerpc.c  
> b/arch/powerpc/kvm/powerpc.c
> > >index 8465c2a..da6bf61 100644
> > >--- a/arch/powerpc/kvm/powerpc.c
> > >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > >+++ b/arch/powerpc/kvm/powerpc.c
> > > 		break;
> > > #endif
> > > 	case KVM_CAP_SPAPR_MULTITCE:
> > >+	case KVM_CAP_SPAPR_TCE_IOMMU:
> > > 		r = 1;
> > > 		break;
> > > 	default:
> >
> > Don't advertise SPAPR capabilities if it's not book3s -- and
> > probably there's some additional limitation that would be
> > appropriate.
> 
> So, in the case of MULTITCE, that's not quite right.  PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.

That might (or might not; consider things like Altivec versus SPE  
opcode conflict, whether unimplemented SPRs trap, behavior of  
unprivileged SPRs/instructions, etc) be true in theory, but it's not  
currently a supported configuration.  BookE KVM does not support  
emulating a different CPU than the host.  In the unlikely case that  
ever changes to the point of allowing PAPR guests on a BookE host, then  
we can revisit how to properly determine whether the capability is  
supported, but for now the capability will never be valid in the  
CONFIG_BOOKE case (though I'd rather see it depend on an appropriate  
book3s symbol than depend on !BOOKE).

Or we could just leave it as is, and let it indicate whether the host  
kernel supports the feature in general, with the user needing to  
understand when it's applicable...  I'm a bit confused by the  
documentation, however -- the MULTITCE capability was documented in the  
"capabilities that can be enabled" section, but I don't see where it  
can be enabled.

-Scott

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-27 14:26         ` Alexey Kardashevskiy
@ 2013-05-27 14:41           ` Paolo Bonzini
  0 siblings, 0 replies; 69+ messages in thread
From: Paolo Bonzini @ 2013-05-27 14:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Gibson, Scott Wood, kvm, linux-kernel, kvm-ppc,
	Alexander Graf, Paul Mackerras, linuxppc-dev

Il 27/05/2013 16:26, Alexey Kardashevskiy ha scritto:
> On 05/27/2013 08:23 PM, Paolo Bonzini wrote:
>> Il 25/05/2013 04:45, David Gibson ha scritto:
>>>>> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>>>> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>>>> +		struct kvm *kvm = filp->private_data;
>>>>> +
>>>>> +		r = -EFAULT;
>>>>> +		if (copy_from_user(&create_tce_iommu, argp,
>>>>> +				sizeof(create_tce_iommu)))
>>>>> +			goto out;
>>>>> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>>>> &create_tce_iommu);
>>>>> +		goto out;
>>>>> +	}
>>
>> Would it make sense to make this the only interface for creating TCEs?
>> That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
>> no hardware IOMMU usage), and have a single ioctl for both cases?
>> There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
>> kvm_vm_ioctl_create_spapr_tce_iommu.
> 
> Just few bits. Is there really much sense in making one function from those
> two? I tried, looked a bit messy.

Cannot really tell without the userspace bits.  But ioctl proliferation
is what the device and one_reg APIs were supposed to avoid...

>> KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
>> could just use a new capability and drop the old ioctl.
> 
> The old capability+ioctl already exist for quite a while and few QEMU
> versions supporting it were released so we do not want just drop it. So
> then what is the benefit of having a new interface with support of both types?
> 
>>  I'm not sure
>> whether you're already considering the ABI to be stable for kvmppc.
> 
> Is any bit of KVM using it? Cannot see from Documentation/ABI.

I mean the userspace ABI (ioctls).

Paolo


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-27 10:23       ` Paolo Bonzini
@ 2013-05-27 14:26         ` Alexey Kardashevskiy
  2013-05-27 14:41           ` Paolo Bonzini
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-27 14:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Gibson, Scott Wood, kvm, linux-kernel, kvm-ppc,
	Alexander Graf, Paul Mackerras, linuxppc-dev

On 05/27/2013 08:23 PM, Paolo Bonzini wrote:
> Il 25/05/2013 04:45, David Gibson ha scritto:
>>>> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>>> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>>> +		struct kvm *kvm = filp->private_data;
>>>> +
>>>> +		r = -EFAULT;
>>>> +		if (copy_from_user(&create_tce_iommu, argp,
>>>> +				sizeof(create_tce_iommu)))
>>>> +			goto out;
>>>> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>>> &create_tce_iommu);
>>>> +		goto out;
>>>> +	}
> 
> Would it make sense to make this the only interface for creating TCEs?
> That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
> no hardware IOMMU usage), and have a single ioctl for both cases?
> There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
> kvm_vm_ioctl_create_spapr_tce_iommu.

Just few bits. Is there really much sense in making one function from those
two? I tried, looked a bit messy.

> KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
> could just use a new capability and drop the old ioctl.

The old capability+ioctl already exist for quite a while and few QEMU
versions supporting it were released so we do not want just drop it. So
then what is the benefit of having a new interface with support of both types?

>  I'm not sure
> whether you're already considering the ABI to be stable for kvmppc.

Is any bit of KVM using it? Cannot see from Documentation/ABI.


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-25  2:45     ` David Gibson
  2013-05-27  2:44       ` Alexey Kardashevskiy
@ 2013-05-27 10:23       ` Paolo Bonzini
  2013-05-27 14:26         ` Alexey Kardashevskiy
  2013-05-28 16:32       ` Scott Wood
  2 siblings, 1 reply; 69+ messages in thread
From: Paolo Bonzini @ 2013-05-27 10:23 UTC (permalink / raw)
  To: David Gibson
  Cc: Scott Wood, Alexey Kardashevskiy, kvm, linux-kernel, kvm-ppc,
	Alexander Graf, Paul Mackerras, linuxppc-dev

Il 25/05/2013 04:45, David Gibson ha scritto:
>> >+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
>> >+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
>> >+		struct kvm *kvm = filp->private_data;
>> >+
>> >+		r = -EFAULT;
>> >+		if (copy_from_user(&create_tce_iommu, argp,
>> >+				sizeof(create_tce_iommu)))
>> >+			goto out;
>> >+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>> >&create_tce_iommu);
>> >+		goto out;
>> >+	}

Would it make sense to make this the only interface for creating TCEs?
That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
no hardware IOMMU usage), and have a single ioctl for both cases?
There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
kvm_vm_ioctl_create_spapr_tce_iommu.

KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
could just use a new capability and drop the old ioctl.  I'm not sure
whether you're already considering the ABI to be stable for kvmppc.

Paolo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-25  2:45     ` David Gibson
@ 2013-05-27  2:44       ` Alexey Kardashevskiy
  2013-05-28 17:45         ` Scott Wood
  2013-05-27 10:23       ` Paolo Bonzini
  2013-05-28 16:32       ` Scott Wood
  2 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-27  2:44 UTC (permalink / raw)
  To: Scott Wood
  Cc: David Gibson, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/25/2013 12:45 PM, David Gibson wrote:
> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>> index 8465c2a..da6bf61 100644
>>> --- a/arch/powerpc/kvm/powerpc.c
>>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> +++ b/arch/powerpc/kvm/powerpc.c
>>> 		break;
>>> #endif
>>> 	case KVM_CAP_SPAPR_MULTITCE:
>>> +	case KVM_CAP_SPAPR_TCE_IOMMU:
>>> 		r = 1;
>>> 		break;
>>> 	default:
>>
>> Don't advertise SPAPR capabilities if it's not book3s -- and
>> probably there's some additional limitation that would be
>> appropriate.
> 
> So, in the case of MULTITCE, that's not quite right.  PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.  We can't make it dependent on PAPR
> mode being selected, because that's enabled per-vcpu, whereas these
> capabilities are queried on the VM before the vcpus are created.
> 
> CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
> host side hardware (i.e. a PAPR style IOMMU), though.


The capability says that the ioctl is supported. If there is no IOMMU group
registered, than it will fail with a reasonable error and nobody gets hurt.
What is the problem?


>>
>>> @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>> 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>>> 		goto out;
>>> 	}
>>> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>> +		struct kvm *kvm = filp->private_data;
>>> +
>>> +		r = -EFAULT;
>>> +		if (copy_from_user(&create_tce_iommu, argp,
>>> +				sizeof(create_tce_iommu)))
>>> +			goto out;
>>> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>> &create_tce_iommu);
>>> +		goto out;
>>> +	}
>>> #endif /* CONFIG_PPC_BOOK3S_64 */
>>>
>>> #ifdef CONFIG_KVM_BOOK3S_64_HV
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index 5a2afda..450c82a 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>>> #define KVM_CAP_PPC_RTAS 91
>>> #define KVM_CAP_IRQ_XICS 92
>>> #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
>>> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
>>
>> Hmm...
> 
> Ah, yeah, that needs to be fixed.  Those were interim numbers so that
> we didn't have to keep changing our internal trees as new upstream
> ioctls got added to the list.  We need to get a proper number for the
> merge, though.
>
>>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>>> #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct
>>> kvm_device_attr)
>>> #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct
>>> kvm_device_attr)
>>>
>>> +/* ioctl for SPAPR TCE IOMMU */
>>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>>> kvm_create_spapr_tce_iommu)
>>
>> Shouldn't this go under the vm ioctl section?


The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is
in this section so I decided to keep them together. Wrong?


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-22 21:06   ` Scott Wood
@ 2013-05-25  2:45     ` David Gibson
  2013-05-27  2:44       ` Alexey Kardashevskiy
                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: David Gibson @ 2013-05-25  2:45 UTC (permalink / raw)
  To: Scott Wood
  Cc: Alexey Kardashevskiy, kvm, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3168 bytes --]

On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> >diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >index 8465c2a..da6bf61 100644
> >--- a/arch/powerpc/kvm/powerpc.c
> >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >+++ b/arch/powerpc/kvm/powerpc.c
> > 		break;
> > #endif
> > 	case KVM_CAP_SPAPR_MULTITCE:
> >+	case KVM_CAP_SPAPR_TCE_IOMMU:
> > 		r = 1;
> > 		break;
> > 	default:
> 
> Don't advertise SPAPR capabilities if it's not book3s -- and
> probably there's some additional limitation that would be
> appropriate.

So, in the case of MULTITCE, that's not quite right.  PR KVM can
emulate a PAPR system on a BookE machine, and there's no reason not to
allow TCE acceleration as well.  We can't make it dependent on PAPR
mode being selected, because that's enabled per-vcpu, whereas these
capabilities are queried on the VM before the vcpus are created.

CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
host side hardware (i.e. a PAPR style IOMMU), though.

> 
> >@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
> > 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
> > 		goto out;
> > 	}
> >+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
> >+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
> >+		struct kvm *kvm = filp->private_data;
> >+
> >+		r = -EFAULT;
> >+		if (copy_from_user(&create_tce_iommu, argp,
> >+				sizeof(create_tce_iommu)))
> >+			goto out;
> >+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
> >&create_tce_iommu);
> >+		goto out;
> >+	}
> > #endif /* CONFIG_PPC_BOOK3S_64 */
> >
> > #ifdef CONFIG_KVM_BOOK3S_64_HV
> >diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >index 5a2afda..450c82a 100644
> >--- a/include/uapi/linux/kvm.h
> >+++ b/include/uapi/linux/kvm.h
> >@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
> > #define KVM_CAP_PPC_RTAS 91
> > #define KVM_CAP_IRQ_XICS 92
> > #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> >+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
> 
> Hmm...

Ah, yeah, that needs to be fixed.  Those were interim numbers so that
we didn't have to keep changing our internal trees as new upstream
ioctls got added to the list.  We need to get a proper number for the
merge, though.

> >@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> > #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct
> >kvm_device_attr)
> > #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct
> >kvm_device_attr)
> >
> >+/* ioctl for SPAPR TCE IOMMU */
> >+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
> >kvm_create_spapr_tce_iommu)
> 
> Shouldn't this go under the vm ioctl section?
> 
> -Scott
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-21  3:06 ` [PATCH 3/4] KVM: PPC: Add support for " Alexey Kardashevskiy
@ 2013-05-22 21:06   ` Scott Wood
  2013-05-25  2:45     ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Scott Wood @ 2013-05-22 21:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, Paul Mackerras, David Gibson

On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 8465c2a..da6bf61 100644
> --- a/arch/powerpc/kvm/powerpc.c
> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> +++ b/arch/powerpc/kvm/powerpc.c
>  		break;
>  #endif
>  	case KVM_CAP_SPAPR_MULTITCE:
> +	case KVM_CAP_SPAPR_TCE_IOMMU:
>  		r = 1;
>  		break;
>  	default:

Don't advertise SPAPR capabilities if it's not book3s -- and probably  
there's some additional limitation that would be appropriate.

> @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>  		goto out;
>  	}
> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
> +		struct kvm *kvm = filp->private_data;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&create_tce_iommu, argp,
> +				sizeof(create_tce_iommu)))
> +			goto out;
> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,  
> &create_tce_iommu);
> +		goto out;
> +	}
>  #endif /* CONFIG_PPC_BOOK3S_64 */
> 
>  #ifdef CONFIG_KVM_BOOK3S_64_HV
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5a2afda..450c82a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_RTAS 91
>  #define KVM_CAP_IRQ_XICS 92
>  #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)

Hmm...

> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct  
> kvm_device_attr)
>  #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct  
> kvm_device_attr)
> 
> +/* ioctl for SPAPR TCE IOMMU */
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct  
> kvm_create_spapr_tce_iommu)

Shouldn't this go under the vm ioctl section?

-Scott

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-21  3:06 [PATCH 0/4 v2] " Alexey Kardashevskiy
@ 2013-05-21  3:06 ` Alexey Kardashevskiy
  2013-05-22 21:06   ` Scott Wood
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-21  3:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, linux-kernel, kvm, kvm-ppc

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---

Changes:
2013-05-20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 Documentation/virtual/kvm/api.txt   |   28 +++++
 arch/powerpc/include/asm/kvm_host.h |    3 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    7 ++
 arch/powerpc/kvm/book3s_64_vio.c    |  198 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/powerpc.c          |   12 +++
 include/uapi/linux/kvm.h            |    4 +
 8 files changed, 441 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 3c7c7ea..3c8e9fe 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,34 @@ calls by the guest for that service will be passed to userspace to be
 handled.
 
 
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	struct iommu_group *grp;    /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
@@ -611,6 +612,8 @@ struct kvm_vcpu_arch {
 	u64 busy_preempt;
 
 	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
+	unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+	unsigned long tce_reason;  /* The reason of switching to the virtmode */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
 		struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 06b7b20..ffb4698 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -56,8 +58,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+	if (stt->grp) {
+		iommu_group_put(stt->grp);
+	} else
+#endif
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -153,6 +160,62 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+	.release	= kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *tt = NULL;
+	struct iommu_group *grp;
+	struct iommu_table *tbl;
+
+	/* Find an IOMMU table for the given ID */
+	grp = iommu_group_get_by_id(args->iommu_id);
+	if (!grp)
+		return -ENXIO;
+
+	tbl = iommu_group_get_iommudata(grp);
+	if (!tbl)
+		return -ENXIO;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+	if (!tt)
+		return -ENOMEM;
+
+	tt->liobn = args->liobn;
+	tt->kvm = kvm;
+	tt->grp = grp;
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+			args->liobn, args->iommu_id, args->flags);
+
+	return anon_inode_getfd("kvm-spapr-tce-iommu",
+			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Converts guest physical address into host virtual */
 static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
 		unsigned long gpa)
@@ -180,6 +243,46 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (vcpu->arch.tce_reason == H_HARDWARE) {
+			iommu_clear_tces_and_put_pages(tbl, entry, 1);
+			return H_HARDWARE;
+
+		} else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+			if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+				return H_PARAMETER;
+
+			ret = iommu_clear_tces_and_put_pages(tbl, entry, 1);
+		} else {
+			void *hva;
+
+			if (iommu_tce_put_param_check(tbl, ioba, tce))
+				return H_PARAMETER;
+
+			hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+			if (hva == ERROR_ADDR)
+				return H_HARDWARE;
+
+			ret = iommu_put_tce_user_mode(tbl,
+					ioba >> IOMMU_PAGE_SHIFT,
+					(unsigned long) hva);
+		}
+		iommu_flush_tce(tbl);
+
+		if (ret)
+			return H_HARDWARE;
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
@@ -220,6 +323,70 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* Something bad happened, do cleanup and exit */
+		if (vcpu->arch.tce_reason == H_HARDWARE) {
+			i = vcpu->arch.tce_tmp_num;
+			goto fail_clear_tce;
+		} else if (vcpu->arch.tce_reason != H_TOO_HARD) {
+			/*
+			 * We get here only in PR KVM mode, otherwise
+			 * the real mode handler would have checked TCEs
+			 * already and failed on guest TCE translation.
+			 */
+			for (i = 0; i < npages; ++i) {
+				if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+					return H_HARDWARE;
+
+				if (iommu_tce_put_param_check(tbl, ioba +
+						(i << IOMMU_PAGE_SHIFT),
+						vcpu->arch.tce_tmp[i]))
+					return H_PARAMETER;
+			}
+		} /* else: The real mode handler checked TCEs already */
+
+		/* Translate TCEs */
+		for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
+			void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+					vcpu->arch.tce_tmp[i]);
+
+			if (hva == ERROR_ADDR)
+				goto fail_clear_tce;
+
+			vcpu->arch.tce_tmp[i] = (unsigned long) hva;
+		}
+
+		/* Do get_page and put TCEs for all pages */
+		for (i = 0; i < npages; ++i) {
+			if (iommu_put_tce_user_mode(tbl,
+					(ioba >> IOMMU_PAGE_SHIFT) + i,
+					vcpu->arch.tce_tmp[i])) {
+				i = npages;
+				goto fail_clear_tce;
+			}
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+
+fail_clear_tce:
+		/* Cannot complete the translation, clean up and exit */
+		iommu_clear_tces_and_put_pages(tbl,
+				ioba >> IOMMU_PAGE_SHIFT, i);
+
+		iommu_flush_tce(tbl);
+
+		return H_HARDWARE;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -253,6 +420,33 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+		unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT;
+
+		vcpu->arch.tce_tmp_num = 0;
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* PR KVM? */
+		if (!vcpu->arch.tce_tmp_num &&
+				(vcpu->arch.tce_reason != H_TOO_HARD) &&
+				iommu_tce_clear_param_check(tbl, ioba,
+						tce_value, npages))
+			return H_PARAMETER;
+
+		/* Do actual cleanup */
+		tmp = vcpu->arch.tce_tmp_num;
+		if (iommu_clear_tces_and_put_pages(tbl, entry + tmp,
+				npages - tmp))
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c68d538..dc4ae32 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -118,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 
 static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
-			unsigned long *pte_sizep)
+			unsigned long *pte_sizep, bool do_get_page)
 {
 	pte_t *ptep;
 	unsigned int shift = 0;
@@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
 	if (!pte_present(*ptep))
 		return __pte(0);
 
+	/*
+	 * Put huge pages handling to the virtual mode.
+	 * The only exception is for TCE list pages which we
+	 * do need to call get_page() for.
+	 */
+	if ((*pte_sizep > PAGE_SIZE) && do_get_page)
+		return __pte(0);
+
 	/* wait until _PAGE_BUSY is clear then set it atomically */
 	__asm__ __volatile__ (
 		"1:	ldarx	%0,0,%3\n"
@@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
 		: "cc");
 
 	ret = pte;
+	if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
+		struct page *pg = NULL;
+		pg = realmode_pfn_to_page(pte_pfn(pte));
+		if (realmode_get_page(pg)) {
+			ret = __pte(0);
+		} else {
+			pte = pte_mkyoung(pte);
+			if (writing)
+				pte = pte_mkdirty(pte);
+		}
+	}
+	*ptep = pte;	/* clears _PAGE_BUSY */
 
 	return ret;
 }
@@ -157,7 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
  * Also returns pte and page size if the page is present in page table.
  */
 static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
-		unsigned long gpa)
+		unsigned long gpa, bool do_get_page)
 {
 	struct kvm_memory_slot *memslot;
 	pte_t pte;
@@ -175,7 +196,7 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 
 	/* Find a PTE and determine the size */
 	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
-			writing, &pg_size);
+			writing, &pg_size, do_get_page);
 	if (!pte)
 		return ERROR_ADDR;
 
@@ -188,6 +209,52 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 	return hpa;
 }
 
+#ifdef CONFIG_IOMMU_API
+static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret = 0, i;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page;
+		unsigned long oldtce;
+
+		oldtce = iommu_clear_tce(tbl, entry + i);
+		if (!oldtce)
+			continue;
+
+		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+		if (!page) {
+			ret = H_TOO_HARD;
+			break;
+		}
+
+		if (oldtce & TCE_PCI_WRITE)
+			SetPageDirty(page);
+
+		if (realmode_put_page(page)) {
+			ret = H_TOO_HARD;
+			break;
+		}
+	}
+
+	if (ret == H_TOO_HARD) {
+		vcpu->arch.tce_tmp_num = i;
+		vcpu->arch.tce_reason = H_TOO_HARD;
+	}
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret); */
+
+	return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -199,6 +266,52 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		vcpu->arch.tce_reason = 0;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, hva;
+
+			if (iommu_tce_put_param_check(tbl, ioba, tce))
+				return H_PARAMETER;
+
+			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+			if (hpa == ERROR_ADDR) {
+				vcpu->arch.tce_reason = H_TOO_HARD;
+				return H_TOO_HARD;
+			}
+
+			hva = (unsigned long) __va(hpa);
+			ret = iommu_tce_build(tbl,
+					ioba >> IOMMU_PAGE_SHIFT,
+					hva, iommu_tce_direction(hva));
+			if (unlikely(ret)) {
+				struct page *pg = realmode_pfn_to_page(hpa);
+				BUG_ON(!pg);
+				if (realmode_put_page(pg)) {
+					vcpu->arch.tce_reason = H_HARDWARE;
+					return H_TOO_HARD;
+				}
+				return H_HARDWARE;
+			}
+		} else {
+			ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1);
+			if (ret)
+				return ret;
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
@@ -235,10 +348,62 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tce_list & ~IOMMU_PAGE_MASK)
 		return H_PARAMETER;
 
-	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+	vcpu->arch.tce_tmp_num = 0;
+	vcpu->arch.tce_reason = 0;
+
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+			tce_list, false);
 	if ((unsigned long)tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* Check all TCEs */
+		for (i = 0; i < npages; ++i) {
+			if (iommu_tce_put_param_check(tbl, ioba +
+					(i << IOMMU_PAGE_SHIFT), tces[i]))
+				return H_PARAMETER;
+			vcpu->arch.tce_tmp[i] = tces[i];
+		}
+
+		/* Translate TCEs and go get_page */
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+					vcpu->arch.tce_tmp[i], true);
+			if (hpa == ERROR_ADDR) {
+				vcpu->arch.tce_tmp_num = i;
+				vcpu->arch.tce_reason = H_TOO_HARD;
+				return H_TOO_HARD;
+			}
+			vcpu->arch.tce_tmp[i] = hpa;
+		}
+
+		/* Put TCEs to the table */
+		for (i = 0; i < npages; ++i) {
+			unsigned long hva = (unsigned long)
+					__va(vcpu->arch.tce_tmp[i]);
+
+			ret = iommu_tce_build(tbl,
+					(ioba >> IOMMU_PAGE_SHIFT) + i,
+					hva, iommu_tce_direction(hva));
+			if (ret) {
+				/* All wrong, go virtmode and do cleanup */
+				vcpu->arch.tce_reason = H_HARDWARE;
+				return H_TOO_HARD;
+			}
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -268,6 +433,26 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		vcpu->arch.tce_reason = 0;
+
+		ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba,
+				tce_value, npages);
+		if (ret)
+			return ret;
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8465c2a..da6bf61 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		break;
 #endif
 	case KVM_CAP_SPAPR_MULTITCE:
+	case KVM_CAP_SPAPR_TCE_IOMMU:
 		r = 1;
 		break;
 	default:
@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5a2afda..450c82a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+/* ioctl for SPAPR TCE IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct kvm_create_spapr_tce_iommu)
+
 /*
  * ioctls for vcpu fds
  */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2013-06-27 11:06 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-05  6:11 [PATCH 0/4 v3] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
2013-06-16  4:20   ` Benjamin Herrenschmidt
2013-06-16 22:06   ` Alexander Graf
2013-06-17  7:55     ` Alexey Kardashevskiy
2013-06-17  8:02       ` Alexander Graf
2013-06-17  8:34         ` Alexey Kardashevskiy
2013-06-17  8:40           ` Alexander Graf
2013-06-17  8:51             ` Alexey Kardashevskiy
2013-06-17 10:46               ` Alexander Graf
2013-06-17 10:48                 ` Alexander Graf
2013-06-17  8:37       ` Benjamin Herrenschmidt
2013-06-17  8:42         ` Alexander Graf
2013-06-05  6:11 ` [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
2013-06-16  4:26   ` Benjamin Herrenschmidt
2013-06-16  4:31     ` Benjamin Herrenschmidt
2013-06-17  9:17     ` Alexey Kardashevskiy
2013-06-05  6:11 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
2013-06-16  4:39   ` Benjamin Herrenschmidt
2013-06-19  3:17     ` Alexey Kardashevskiy
2013-06-16 22:25   ` Alexander Graf
2013-06-16 22:39   ` Benjamin Herrenschmidt
2013-06-17  3:13     ` Alex Williamson
2013-06-17  3:56       ` Benjamin Herrenschmidt
2013-06-18  2:32         ` Alex Williamson
2013-06-18  4:38           ` Benjamin Herrenschmidt
2013-06-18 14:48             ` Alex Williamson
2013-06-18 21:58               ` Benjamin Herrenschmidt
2013-06-19  3:35           ` Rusty Russell
2013-06-19  4:59             ` Benjamin Herrenschmidt
2013-06-19  9:58               ` Alexander Graf
2013-06-19 14:50                 ` Benjamin Herrenschmidt
2013-06-19 15:49                   ` Alex Williamson
2013-06-20  4:58                     ` Alexey Kardashevskiy
2013-06-20  5:28                       ` David Gibson
2013-06-20  7:47                         ` Benjamin Herrenschmidt
2013-06-20  8:48                           ` Alexey Kardashevskiy
2013-06-20 14:55                             ` Alex Williamson
2013-06-22  8:25                               ` Alexey Kardashevskiy
2013-06-22 12:03                               ` David Gibson
2013-06-22 14:28                                 ` Alex Williamson
2013-06-24  3:52                                   ` David Gibson
2013-06-24  4:41                                     ` Alex Williamson
2013-06-27 11:01                                       ` David Gibson
2013-06-22 23:28                                 ` Benjamin Herrenschmidt
2013-06-24  3:54                                   ` David Gibson
2013-06-24  3:58                                     ` Benjamin Herrenschmidt
2013-06-05  6:11 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy
2013-06-16  4:46   ` Benjamin Herrenschmidt
2013-06-17 16:35   ` Paolo Bonzini
2013-06-12  3:14 ` [PATCH 0/4 v3] KVM: PPC: " Benjamin Herrenschmidt
  -- strict thread matches above, loose matches on Subject: below --
2013-05-21  3:06 [PATCH 0/4 v2] " Alexey Kardashevskiy
2013-05-21  3:06 ` [PATCH 3/4] KVM: PPC: Add support for " Alexey Kardashevskiy
2013-05-22 21:06   ` Scott Wood
2013-05-25  2:45     ` David Gibson
2013-05-27  2:44       ` Alexey Kardashevskiy
2013-05-28 17:45         ` Scott Wood
2013-05-28 23:30           ` Alexey Kardashevskiy
2013-05-28 23:35             ` Scott Wood
2013-05-29  0:12               ` Alexey Kardashevskiy
2013-05-29 20:05                 ` Scott Wood
2013-05-29 23:10                   ` Alexey Kardashevskiy
2013-05-29 23:14                     ` Scott Wood
2013-05-29 23:29                       ` Alexey Kardashevskiy
2013-05-29 23:32                         ` Scott Wood
2013-05-27 10:23       ` Paolo Bonzini
2013-05-27 14:26         ` Alexey Kardashevskiy
2013-05-27 14:41           ` Paolo Bonzini
2013-05-28 16:32       ` Scott Wood
2013-05-29  0:20         ` Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).