linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling
@ 2013-05-21  3:06 Alexey Kardashevskiy
  2013-05-21  3:06 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
                   ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-21  3:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, David Gibson

This accelerates IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.

The first patch with multitce support is useful for emulated devices as is.

The other patches are designed for VFIO although this series
does not contain any VFIO related code as the connection between
VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.

The series was made and tested against v3.10-rc1.


Alexey Kardashevskiy (4):
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt        |   42 +++
 arch/powerpc/include/asm/kvm_host.h      |    7 +
 arch/powerpc/include/asm/kvm_ppc.h       |   40 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 +
 arch/powerpc/include/uapi/asm/kvm.h      |    7 +
 arch/powerpc/kvm/book3s_64_vio.c         |  398 ++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c      |  471 ++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv.c             |   39 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c        |   37 ++-
 arch/powerpc/kvm/powerpc.c               |   15 +
 arch/powerpc/mm/init_64.c                |   77 ++++-
 include/uapi/linux/kvm.h                 |    5 +
 13 files changed, 1120 insertions(+), 28 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-05-21  3:06 [PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
@ 2013-05-21  3:06 ` Alexey Kardashevskiy
  2013-05-27 10:08   ` Paolo Bonzini
  2013-05-21  3:06 ` [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-21  3:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, David Gibson

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---
Changelog:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt       |   14 ++
 arch/powerpc/include/asm/kvm_host.h     |    2 +
 arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
 arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
 arch/powerpc/kvm/book3s_hv.c            |   39 +++++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 10 files changed, 470 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 5f91eda..3c7c7ea 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd
             args[1] is the XICS CPU number (server ID) for this vcpu
 
 This capability connects the vcpu to an in-kernel XICS device.
+
+6.8 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..85d8f26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
 	spinlock_t tbacct_lock;
 	u64 busy_stolen;
 	u64 busy_preempt;
+
+	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-			     unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+		struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..06b7b20 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -36,8 +37,11 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
@@ -148,3 +152,117 @@ fail:
 	}
 	return ret;
 }
+
+/* Converts guest physical address into host virtual */
+static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+	return (void *) hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce);
+	if (ret)
+		return ret;
+
+	kvmppc_emulated_put_tce(tt, ioba, tce);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+	unsigned long __user *tces;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs so
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+			return H_PARAMETER;
+
+		ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
+		if (ret)
+			return ret;
+	}
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_emulated_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT),
+				vcpu->arch.tce_tmp[i]);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c68d538 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -35,42 +36,249 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+/* Finds a TCE table descriptor by LIOBN */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == liobn)
+			return tt;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validate TCE address.
+ * At the moment only flags are validated
+ * as other check will significantly slow down
+ * or can make it even impossible to handle TCE requests
+ * in real mode.
+ */
+long kvmppc_emulated_validate_tce(unsigned long tce)
+{
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
+
+/*
+ * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Wiorks in both real and virtual modes.
+ * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
  */
+void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/*
+	 * Note on the use of page_address() in real mode,
+	 *
+	 * It is safe to use page_address() in real mode on ppc64 because
+	 * page_address() is always defined as lowmem_page_address()
+	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+	 * operation and does not access page struct.
+	 *
+	 * Theoretically page_address() could be defined different
+	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+	 * should be enabled.
+	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+	 * is not expected to be enabled on ppc32, page_address()
+	 * is safe for ppc32 as well.
+	 */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	page = tt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+
+static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
+			unsigned long *pte_sizep)
+{
+	pte_t *ptep;
+	unsigned int shift = 0;
+	pte_t pte, tmp, ret;
+
+	ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
+	if (!ptep)
+		return __pte(0);
+	if (shift)
+		*pte_sizep = 1ul << shift;
+	else
+		*pte_sizep = PAGE_SIZE;
+
+	if (!pte_present(*ptep))
+		return __pte(0);
+
+	/* wait until _PAGE_BUSY is clear then set it atomically */
+	__asm__ __volatile__ (
+		"1:	ldarx	%0,0,%3\n"
+		"	andi.	%1,%0,%4\n"
+		"	bne-	1b\n"
+		"	ori	%1,%0,%4\n"
+		"	stdcx.	%1,0,%3\n"
+		"	bne-	1b"
+		: "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
+		: "r" (ptep), "i" (_PAGE_BUSY)
+		: "cc");
+
+	ret = pte;
+
+	return ret;
+}
+
+/*
+ * Converts guest physical address into host physical address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, hpa, pg_size = 0, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+	bool writing = gpa & TCE_PCI_WRITE;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte)
+		return ERROR_ADDR;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	return hpa;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
-	struct kvmppc_spapr_tce_table *stt;
-
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
-
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
-
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
-
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
-
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce);
+	if (ret)
+		return ret;
+
+	kvmppc_emulated_put_tce(tt, ioba, tce);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+	unsigned long *tces;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs so
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+	if ((unsigned long)tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		ret = kvmppc_emulated_validate_tce(tces[i]);
+		if (ret)
+			return ret;
 	}
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	for (i = 0; i < npages; ++i)
+		kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				tces[i]);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
 }
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9de24f8..9d70fd8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -566,6 +566,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 			ret = kvmppc_xics_hcall(vcpu, req);
 			break;
 		} /* fallthrough */
+		return RESUME_HOST;
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
@@ -956,6 +980,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	kvmppc_sanity_check(vcpu);
 
+	/*
+	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+	 * half executed, we first read TCEs from the user, check them and
+	 * return error if something went wrong and only then put TCEs into
+	 * the TCE table.
+	 *
+	 * tce_tmp is a cache for TCEs to avoid stack allocation or
+	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+	 * each (4096 bytes).
+	 */
+	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
+	if (!vcpu->arch.tce_tmp)
+		goto free_vcpu;
+
 	return vcpu;
 
 free_vcpu:
@@ -978,6 +1016,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
 	spin_unlock(&vcpu->arch.vpa_update_lock);
+	kfree(vcpu->arch.tce_tmp);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vcpu);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..d35554e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1490,6 +1490,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index b24309c..3750e3c 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..8465c2a 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_SPAPR_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a5c86fc..5a2afda 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_MPIC 90
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
  2013-05-21  3:06 [PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
  2013-05-21  3:06 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
@ 2013-05-21  3:06 ` Alexey Kardashevskiy
  2013-05-21  3:06 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
  2013-05-21  3:06 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy
  3 siblings, 0 replies; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-21  3:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, David Gibson

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---

Changes:
2013-05-20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 ++
 arch/powerpc/mm/init_64.c                |   77 +++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c2787bf..ba6cf9b 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -296,5 +296,80 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
 
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	if (!atomic_add_unless(&page->_count, -1, 1))
+		return -EAGAIN;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-21  3:06 [PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
  2013-05-21  3:06 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
  2013-05-21  3:06 ` [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
@ 2013-05-21  3:06 ` Alexey Kardashevskiy
  2013-05-22 21:06   ` Scott Wood
  2013-05-21  3:06 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy
  3 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-21  3:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, David Gibson

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---

Changes:
2013-05-20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 Documentation/virtual/kvm/api.txt   |   28 +++++
 arch/powerpc/include/asm/kvm_host.h |    3 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    7 ++
 arch/powerpc/kvm/book3s_64_vio.c    |  198 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/powerpc.c          |   12 +++
 include/uapi/linux/kvm.h            |    4 +
 8 files changed, 441 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 3c7c7ea..3c8e9fe 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,34 @@ calls by the guest for that service will be passed to userspace to be
 handled.
 
 
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	struct iommu_group *grp;    /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
@@ -611,6 +612,8 @@ struct kvm_vcpu_arch {
 	u64 busy_preempt;
 
 	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
+	unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+	unsigned long tce_reason;  /* The reason of switching to the virtmode */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
 		struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 06b7b20..ffb4698 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -56,8 +58,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+	if (stt->grp) {
+		iommu_group_put(stt->grp);
+	} else
+#endif
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -153,6 +160,62 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+	.release	= kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *tt = NULL;
+	struct iommu_group *grp;
+	struct iommu_table *tbl;
+
+	/* Find an IOMMU table for the given ID */
+	grp = iommu_group_get_by_id(args->iommu_id);
+	if (!grp)
+		return -ENXIO;
+
+	tbl = iommu_group_get_iommudata(grp);
+	if (!tbl)
+		return -ENXIO;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+	if (!tt)
+		return -ENOMEM;
+
+	tt->liobn = args->liobn;
+	tt->kvm = kvm;
+	tt->grp = grp;
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+			args->liobn, args->iommu_id, args->flags);
+
+	return anon_inode_getfd("kvm-spapr-tce-iommu",
+			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Converts guest physical address into host virtual */
 static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
 		unsigned long gpa)
@@ -180,6 +243,46 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (vcpu->arch.tce_reason == H_HARDWARE) {
+			iommu_clear_tces_and_put_pages(tbl, entry, 1);
+			return H_HARDWARE;
+
+		} else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+			if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+				return H_PARAMETER;
+
+			ret = iommu_clear_tces_and_put_pages(tbl, entry, 1);
+		} else {
+			void *hva;
+
+			if (iommu_tce_put_param_check(tbl, ioba, tce))
+				return H_PARAMETER;
+
+			hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+			if (hva == ERROR_ADDR)
+				return H_HARDWARE;
+
+			ret = iommu_put_tce_user_mode(tbl,
+					ioba >> IOMMU_PAGE_SHIFT,
+					(unsigned long) hva);
+		}
+		iommu_flush_tce(tbl);
+
+		if (ret)
+			return H_HARDWARE;
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
@@ -220,6 +323,70 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* Something bad happened, do cleanup and exit */
+		if (vcpu->arch.tce_reason == H_HARDWARE) {
+			i = vcpu->arch.tce_tmp_num;
+			goto fail_clear_tce;
+		} else if (vcpu->arch.tce_reason != H_TOO_HARD) {
+			/*
+			 * We get here only in PR KVM mode, otherwise
+			 * the real mode handler would have checked TCEs
+			 * already and failed on guest TCE translation.
+			 */
+			for (i = 0; i < npages; ++i) {
+				if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+					return H_HARDWARE;
+
+				if (iommu_tce_put_param_check(tbl, ioba +
+						(i << IOMMU_PAGE_SHIFT),
+						vcpu->arch.tce_tmp[i]))
+					return H_PARAMETER;
+			}
+		} /* else: The real mode handler checked TCEs already */
+
+		/* Translate TCEs */
+		for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
+			void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+					vcpu->arch.tce_tmp[i]);
+
+			if (hva == ERROR_ADDR)
+				goto fail_clear_tce;
+
+			vcpu->arch.tce_tmp[i] = (unsigned long) hva;
+		}
+
+		/* Do get_page and put TCEs for all pages */
+		for (i = 0; i < npages; ++i) {
+			if (iommu_put_tce_user_mode(tbl,
+					(ioba >> IOMMU_PAGE_SHIFT) + i,
+					vcpu->arch.tce_tmp[i])) {
+				i = npages;
+				goto fail_clear_tce;
+			}
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+
+fail_clear_tce:
+		/* Cannot complete the translation, clean up and exit */
+		iommu_clear_tces_and_put_pages(tbl,
+				ioba >> IOMMU_PAGE_SHIFT, i);
+
+		iommu_flush_tce(tbl);
+
+		return H_HARDWARE;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -253,6 +420,33 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+		unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT;
+
+		vcpu->arch.tce_tmp_num = 0;
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* PR KVM? */
+		if (!vcpu->arch.tce_tmp_num &&
+				(vcpu->arch.tce_reason != H_TOO_HARD) &&
+				iommu_tce_clear_param_check(tbl, ioba,
+						tce_value, npages))
+			return H_PARAMETER;
+
+		/* Do actual cleanup */
+		tmp = vcpu->arch.tce_tmp_num;
+		if (iommu_clear_tces_and_put_pages(tbl, entry + tmp,
+				npages - tmp))
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c68d538..dc4ae32 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -118,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 
 static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
-			unsigned long *pte_sizep)
+			unsigned long *pte_sizep, bool do_get_page)
 {
 	pte_t *ptep;
 	unsigned int shift = 0;
@@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
 	if (!pte_present(*ptep))
 		return __pte(0);
 
+	/*
+	 * Put huge pages handling to the virtual mode.
+	 * The only exception is for TCE list pages which we
+	 * do need to call get_page() for.
+	 */
+	if ((*pte_sizep > PAGE_SIZE) && do_get_page)
+		return __pte(0);
+
 	/* wait until _PAGE_BUSY is clear then set it atomically */
 	__asm__ __volatile__ (
 		"1:	ldarx	%0,0,%3\n"
@@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
 		: "cc");
 
 	ret = pte;
+	if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
+		struct page *pg = NULL;
+		pg = realmode_pfn_to_page(pte_pfn(pte));
+		if (realmode_get_page(pg)) {
+			ret = __pte(0);
+		} else {
+			pte = pte_mkyoung(pte);
+			if (writing)
+				pte = pte_mkdirty(pte);
+		}
+	}
+	*ptep = pte;	/* clears _PAGE_BUSY */
 
 	return ret;
 }
@@ -157,7 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
  * Also returns pte and page size if the page is present in page table.
  */
 static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
-		unsigned long gpa)
+		unsigned long gpa, bool do_get_page)
 {
 	struct kvm_memory_slot *memslot;
 	pte_t pte;
@@ -175,7 +196,7 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 
 	/* Find a PTE and determine the size */
 	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
-			writing, &pg_size);
+			writing, &pg_size, do_get_page);
 	if (!pte)
 		return ERROR_ADDR;
 
@@ -188,6 +209,52 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 	return hpa;
 }
 
+#ifdef CONFIG_IOMMU_API
+static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret = 0, i;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page;
+		unsigned long oldtce;
+
+		oldtce = iommu_clear_tce(tbl, entry + i);
+		if (!oldtce)
+			continue;
+
+		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+		if (!page) {
+			ret = H_TOO_HARD;
+			break;
+		}
+
+		if (oldtce & TCE_PCI_WRITE)
+			SetPageDirty(page);
+
+		if (realmode_put_page(page)) {
+			ret = H_TOO_HARD;
+			break;
+		}
+	}
+
+	if (ret == H_TOO_HARD) {
+		vcpu->arch.tce_tmp_num = i;
+		vcpu->arch.tce_reason = H_TOO_HARD;
+	}
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret); */
+
+	return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -199,6 +266,52 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		vcpu->arch.tce_reason = 0;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, hva;
+
+			if (iommu_tce_put_param_check(tbl, ioba, tce))
+				return H_PARAMETER;
+
+			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+			if (hpa == ERROR_ADDR) {
+				vcpu->arch.tce_reason = H_TOO_HARD;
+				return H_TOO_HARD;
+			}
+
+			hva = (unsigned long) __va(hpa);
+			ret = iommu_tce_build(tbl,
+					ioba >> IOMMU_PAGE_SHIFT,
+					hva, iommu_tce_direction(hva));
+			if (unlikely(ret)) {
+				struct page *pg = realmode_pfn_to_page(hpa);
+				BUG_ON(!pg);
+				if (realmode_put_page(pg)) {
+					vcpu->arch.tce_reason = H_HARDWARE;
+					return H_TOO_HARD;
+				}
+				return H_HARDWARE;
+			}
+		} else {
+			ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1);
+			if (ret)
+				return ret;
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
@@ -235,10 +348,62 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tce_list & ~IOMMU_PAGE_MASK)
 		return H_PARAMETER;
 
-	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+	vcpu->arch.tce_tmp_num = 0;
+	vcpu->arch.tce_reason = 0;
+
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+			tce_list, false);
 	if ((unsigned long)tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* Check all TCEs */
+		for (i = 0; i < npages; ++i) {
+			if (iommu_tce_put_param_check(tbl, ioba +
+					(i << IOMMU_PAGE_SHIFT), tces[i]))
+				return H_PARAMETER;
+			vcpu->arch.tce_tmp[i] = tces[i];
+		}
+
+		/* Translate TCEs and go get_page */
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+					vcpu->arch.tce_tmp[i], true);
+			if (hpa == ERROR_ADDR) {
+				vcpu->arch.tce_tmp_num = i;
+				vcpu->arch.tce_reason = H_TOO_HARD;
+				return H_TOO_HARD;
+			}
+			vcpu->arch.tce_tmp[i] = hpa;
+		}
+
+		/* Put TCEs to the table */
+		for (i = 0; i < npages; ++i) {
+			unsigned long hva = (unsigned long)
+					__va(vcpu->arch.tce_tmp[i]);
+
+			ret = iommu_tce_build(tbl,
+					(ioba >> IOMMU_PAGE_SHIFT) + i,
+					hva, iommu_tce_direction(hva));
+			if (ret) {
+				/* All wrong, go virtmode and do cleanup */
+				vcpu->arch.tce_reason = H_HARDWARE;
+				return H_TOO_HARD;
+			}
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -268,6 +433,26 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		vcpu->arch.tce_reason = 0;
+
+		ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba,
+				tce_value, npages);
+		if (ret)
+			return ret;
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8465c2a..da6bf61 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		break;
 #endif
 	case KVM_CAP_SPAPR_MULTITCE:
+	case KVM_CAP_SPAPR_TCE_IOMMU:
 		r = 1;
 		break;
 	default:
@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5a2afda..450c82a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+/* ioctl for SPAPR TCE IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct kvm_create_spapr_tce_iommu)
+
 /*
  * ioctls for vcpu fds
  */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
  2013-05-21  3:06 [PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2013-05-21  3:06 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
@ 2013-05-21  3:06 ` Alexey Kardashevskiy
  3 siblings, 0 replies; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-21  3:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, David Gibson

This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---

Changes:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   22 +++++++++
 arch/powerpc/kvm/book3s_64_vio.c    |   88 +++++++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_64_vio_hv.c |   40 ++++++++++++++--
 4 files changed, 146 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ac0e2fe..4fc0865 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
 	u64 liobn;
 	u32 window_size;
 	struct iommu_group *grp;    /* used for IOMMU groups */
+	struct list_head hugepages; /* used for IOMMU groups */
+	spinlock_t hugepages_lock;  /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 934e01d..9054df0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct kvmppc_iommu_hugepage {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long gpa;	/* Guest physical address */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+};
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index ffb4698..c34d63a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,6 +45,71 @@
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR      ((void *)~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the list  */
+static long kvmppc_iommu_hugepage_try_add(
+		struct kvmppc_spapr_tce_table *tt,
+		pte_t pte, unsigned long hva, unsigned long gpa,
+		unsigned long pg_size)
+{
+	long ret = 0;
+	struct kvmppc_iommu_hugepage *hp;
+	struct page *p;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry(hp, &tt->hugepages, list) {
+		if (hp->pte == pte)
+			goto unlock_exit;
+	}
+
+	hva = hva & ~(pg_size - 1);
+	ret = get_user_pages_fast(hva, 1, true/*write*/, &p);
+	if ((ret != 1) || !p) {
+		ret = -EFAULT;
+		goto unlock_exit;
+	}
+	ret = 0;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp) {
+		ret = -ENOMEM;
+		goto unlock_exit;
+	}
+
+	hp->page = p;
+	hp->pte = pte;
+	hp->gpa = gpa & ~(pg_size - 1);
+	hp->size = pg_size;
+
+	list_add(&hp->list, &tt->hugepages);
+
+unlock_exit:
+	spin_unlock(&tt->hugepages_lock);
+
+	return ret;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+	INIT_LIST_HEAD(&tt->hugepages);
+	spin_lock_init(&tt->hugepages_lock);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+	struct kvmppc_iommu_hugepage *hp, *tmp;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) {
+		list_del(&hp->list);
+		put_page(hp->page); /* one for iommu_put_tce_user_mode */
+		put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */
+		kfree(hp);
+	}
+	spin_unlock(&tt->hugepages_lock);
+}
+#endif /* CONFIG_IOMMU_API */
+
 static long kvmppc_stt_npages(unsigned long window_size)
 {
 	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 #ifdef CONFIG_IOMMU_API
 	if (stt->grp) {
 		iommu_group_put(stt->grp);
+		kvmppc_iommu_hugepages_cleanup(stt);
 	} else
 #endif
 		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
 	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
+	kvmppc_iommu_hugepages_init(tt);
 	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
 
 	mutex_unlock(&kvm->lock);
@@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
 
 /* Converts guest physical address into host virtual */
 static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
 		unsigned long gpa)
 {
 	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
 	struct kvm_memory_slot *memslot;
+	pte_t *ptep;
+	unsigned int shift = 0;
 
 	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
 	if (!memslot)
 		return ERROR_ADDR;
 
 	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+	ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+	WARN_ON(!ptep);
+	if (!ptep)
+		return ERROR_ADDR;
+
+	if (tt && (shift > PAGE_SHIFT)) {
+		if (kvmppc_iommu_hugepage_try_add(tt, *ptep,
+				hva, gpa, 1 << shift))
+			return ERROR_ADDR;
+	}
+
 	return (void *) hva;
 }
 
@@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 			if (iommu_tce_put_param_check(tbl, ioba, tce))
 				return H_PARAMETER;
 
-			hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+			hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce);
 			if (hva == ERROR_ADDR)
 				return H_HARDWARE;
 
@@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tce_list & ~IOMMU_PAGE_MASK)
 		return H_PARAMETER;
 
-	tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+	tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list);
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
@@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
 		/* Translate TCEs */
 		for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
-			void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+			void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt,
 					vcpu->arch.tce_tmp[i]);
 
 			if (hva == ERROR_ADDR)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index dc4ae32..6245365 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
  * Also returns pte and page size if the page is present in page table.
  */
 static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
 		unsigned long gpa, bool do_get_page)
 {
 	struct kvm_memory_slot *memslot;
@@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 	unsigned long hva, hpa, pg_size = 0, offset;
 	unsigned long gfn = gpa >> PAGE_SHIFT;
 	bool writing = gpa & TCE_PCI_WRITE;
+	struct kvmppc_iommu_hugepage *hp;
 
+	/*
+	 * Try to find an already used hugepage.
+	 * If it is not there, the kvmppc_lookup_pte() will return zero
+	 * as it won't do get_page() on a huge page in real mode
+	 * and therefore the request will be passed to the virtual mode.
+	 */
+	if (tt) {
+		spin_lock(&tt->hugepages_lock);
+		list_for_each_entry(hp, &tt->hugepages, list) {
+			if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+				continue;
+
+			/* Calculate host phys address keeping flags and offset in the page */
+			offset = gpa & (hp->size - 1);
+
+			/* pte_pfn(pte) should return an address aligned to pg_size */
+			hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset;
+			spin_unlock(&tt->hugepages_lock);
+
+			return hpa;
+		}
+		spin_unlock(&tt->hugepages_lock);
+	}
 	/* Find a KVM memslot */
 	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
 	if (!memslot)
@@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
 		if (oldtce & TCE_PCI_WRITE)
 			SetPageDirty(page);
 
+		/* Do not put a huge page and continue without error */
+		if (PageCompound(page))
+			continue;
+
 		if (realmode_put_page(page)) {
 			ret = H_TOO_HARD;
 			break;
@@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			if (iommu_tce_put_param_check(tbl, ioba, tce))
 				return H_PARAMETER;
 
-			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true);
 			if (hpa == ERROR_ADDR) {
 				vcpu->arch.tce_reason = H_TOO_HARD;
 				return H_TOO_HARD;
@@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			if (unlikely(ret)) {
 				struct page *pg = realmode_pfn_to_page(hpa);
 				BUG_ON(!pg);
+
+				/* Do not put a huge page and return an error */
+				if (!PageCompound(pg))
+					return H_HARDWARE;
+
 				if (realmode_put_page(pg)) {
 					vcpu->arch.tce_reason = H_HARDWARE;
 					return H_TOO_HARD;
@@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	vcpu->arch.tce_tmp_num = 0;
 	vcpu->arch.tce_reason = 0;
 
-	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL,
 			tce_list, false);
 	if ((unsigned long)tces == ERROR_ADDR)
 		return H_TOO_HARD;
@@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
 		/* Translate TCEs and go get_page */
 		for (i = 0; i < npages; ++i) {
-			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt,
 					vcpu->arch.tce_tmp[i], true);
 			if (hpa == ERROR_ADDR) {
 				vcpu->arch.tce_tmp_num = i;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-21  3:06 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
@ 2013-05-22 21:06   ` Scott Wood
  2013-05-25  2:45     ` David Gibson
  0 siblings, 1 reply; 36+ messages in thread
From: Scott Wood @ 2013-05-22 21:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Alexey Kardashevskiy, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev, David Gibson

On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 8465c2a..da6bf61 100644
> --- a/arch/powerpc/kvm/powerpc.c
> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> +++ b/arch/powerpc/kvm/powerpc.c
>  		break;
>  #endif
>  	case KVM_CAP_SPAPR_MULTITCE:
> +	case KVM_CAP_SPAPR_TCE_IOMMU:
>  		r =3D 1;
>  		break;
>  	default:

Don't advertise SPAPR capabilities if it's not book3s -- and probably =20
there's some additional limitation that would be appropriate.

> @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  		r =3D kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>  		goto out;
>  	}
> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
> +		struct kvm *kvm =3D filp->private_data;
> +
> +		r =3D -EFAULT;
> +		if (copy_from_user(&create_tce_iommu, argp,
> +				sizeof(create_tce_iommu)))
> +			goto out;
> +		r =3D kvm_vm_ioctl_create_spapr_tce_iommu(kvm, =20
> &create_tce_iommu);
> +		goto out;
> +	}
>  #endif /* CONFIG_PPC_BOOK3S_64 */
>=20
>  #ifdef CONFIG_KVM_BOOK3S_64_HV
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5a2afda..450c82a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_RTAS 91
>  #define KVM_CAP_IRQ_XICS 92
>  #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)

Hmm...

> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct =20
> kvm_device_attr)
>  #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct =20
> kvm_device_attr)
>=20
> +/* ioctl for SPAPR TCE IOMMU */
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct =20
> kvm_create_spapr_tce_iommu)

Shouldn't this go under the vm ioctl section?

-Scott=

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-22 21:06   ` Scott Wood
@ 2013-05-25  2:45     ` David Gibson
  2013-05-27  2:44       ` Alexey Kardashevskiy
                         ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: David Gibson @ 2013-05-25  2:45 UTC (permalink / raw)
  To: Scott Wood
  Cc: kvm, Alexey Kardashevskiy, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3168 bytes --]

On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> >diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >index 8465c2a..da6bf61 100644
> >--- a/arch/powerpc/kvm/powerpc.c
> >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >+++ b/arch/powerpc/kvm/powerpc.c
> > 		break;
> > #endif
> > 	case KVM_CAP_SPAPR_MULTITCE:
> >+	case KVM_CAP_SPAPR_TCE_IOMMU:
> > 		r = 1;
> > 		break;
> > 	default:
> 
> Don't advertise SPAPR capabilities if it's not book3s -- and
> probably there's some additional limitation that would be
> appropriate.

So, in the case of MULTITCE, that's not quite right.  PR KVM can
emulate a PAPR system on a BookE machine, and there's no reason not to
allow TCE acceleration as well.  We can't make it dependent on PAPR
mode being selected, because that's enabled per-vcpu, whereas these
capabilities are queried on the VM before the vcpus are created.

CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
host side hardware (i.e. a PAPR style IOMMU), though.

> 
> >@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
> > 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
> > 		goto out;
> > 	}
> >+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
> >+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
> >+		struct kvm *kvm = filp->private_data;
> >+
> >+		r = -EFAULT;
> >+		if (copy_from_user(&create_tce_iommu, argp,
> >+				sizeof(create_tce_iommu)))
> >+			goto out;
> >+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
> >&create_tce_iommu);
> >+		goto out;
> >+	}
> > #endif /* CONFIG_PPC_BOOK3S_64 */
> >
> > #ifdef CONFIG_KVM_BOOK3S_64_HV
> >diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >index 5a2afda..450c82a 100644
> >--- a/include/uapi/linux/kvm.h
> >+++ b/include/uapi/linux/kvm.h
> >@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
> > #define KVM_CAP_PPC_RTAS 91
> > #define KVM_CAP_IRQ_XICS 92
> > #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> >+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
> 
> Hmm...

Ah, yeah, that needs to be fixed.  Those were interim numbers so that
we didn't have to keep changing our internal trees as new upstream
ioctls got added to the list.  We need to get a proper number for the
merge, though.

> >@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> > #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct
> >kvm_device_attr)
> > #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct
> >kvm_device_attr)
> >
> >+/* ioctl for SPAPR TCE IOMMU */
> >+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
> >kvm_create_spapr_tce_iommu)
> 
> Shouldn't this go under the vm ioctl section?
> 
> -Scott
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-25  2:45     ` David Gibson
@ 2013-05-27  2:44       ` Alexey Kardashevskiy
  2013-05-28 17:45         ` Scott Wood
  2013-05-27 10:23       ` Paolo Bonzini
  2013-05-28 16:32       ` Scott Wood
  2 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-27  2:44 UTC (permalink / raw)
  To: Scott Wood
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/25/2013 12:45 PM, David Gibson wrote:
> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>> index 8465c2a..da6bf61 100644
>>> --- a/arch/powerpc/kvm/powerpc.c
>>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> +++ b/arch/powerpc/kvm/powerpc.c
>>> 		break;
>>> #endif
>>> 	case KVM_CAP_SPAPR_MULTITCE:
>>> +	case KVM_CAP_SPAPR_TCE_IOMMU:
>>> 		r = 1;
>>> 		break;
>>> 	default:
>>
>> Don't advertise SPAPR capabilities if it's not book3s -- and
>> probably there's some additional limitation that would be
>> appropriate.
> 
> So, in the case of MULTITCE, that's not quite right.  PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.  We can't make it dependent on PAPR
> mode being selected, because that's enabled per-vcpu, whereas these
> capabilities are queried on the VM before the vcpus are created.
> 
> CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
> host side hardware (i.e. a PAPR style IOMMU), though.


The capability says that the ioctl is supported. If there is no IOMMU group
registered, than it will fail with a reasonable error and nobody gets hurt.
What is the problem?


>>
>>> @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>> 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>>> 		goto out;
>>> 	}
>>> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>> +		struct kvm *kvm = filp->private_data;
>>> +
>>> +		r = -EFAULT;
>>> +		if (copy_from_user(&create_tce_iommu, argp,
>>> +				sizeof(create_tce_iommu)))
>>> +			goto out;
>>> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>> &create_tce_iommu);
>>> +		goto out;
>>> +	}
>>> #endif /* CONFIG_PPC_BOOK3S_64 */
>>>
>>> #ifdef CONFIG_KVM_BOOK3S_64_HV
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index 5a2afda..450c82a 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>>> #define KVM_CAP_PPC_RTAS 91
>>> #define KVM_CAP_IRQ_XICS 92
>>> #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
>>> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
>>
>> Hmm...
> 
> Ah, yeah, that needs to be fixed.  Those were interim numbers so that
> we didn't have to keep changing our internal trees as new upstream
> ioctls got added to the list.  We need to get a proper number for the
> merge, though.
>
>>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>>> #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct
>>> kvm_device_attr)
>>> #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct
>>> kvm_device_attr)
>>>
>>> +/* ioctl for SPAPR TCE IOMMU */
>>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>>> kvm_create_spapr_tce_iommu)
>>
>> Shouldn't this go under the vm ioctl section?


The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is
in this section so I decided to keep them together. Wrong?


-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-05-21  3:06 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
@ 2013-05-27 10:08   ` Paolo Bonzini
  2013-05-27 14:33     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2013-05-27 10:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras,
	linuxppc-dev, David Gibson

Il 21/05/2013 05:06, Alexey Kardashevskiy ha scritto:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.

Do you mean vio?  virtio (without getting into whether that's good or
bad) will bypass the iommu.

Paolo

> These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> 
> ---
> Changelog:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
> ---
>  Documentation/virtual/kvm/api.txt       |   14 ++
>  arch/powerpc/include/asm/kvm_host.h     |    2 +
>  arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>  arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>  arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>  arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>  arch/powerpc/kvm/powerpc.c              |    3 +
>  include/uapi/linux/kvm.h                |    1 +
>  10 files changed, 470 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 5f91eda..3c7c7ea 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd
>              args[1] is the XICS CPU number (server ID) for this vcpu
>  
>  This capability connects the vcpu to an in-kernel XICS device.
> +
> +6.8 KVM_CAP_PPC_MULTITCE
> +
> +Architectures: ppc
> +Parameters: none
> +Returns: 0 on success; -1 on error
> +
> +This capability enables the guest to put/remove multiple TCE entries
> +per hypercall which significanly accelerates DMA operations for PPC KVM
> +guests.
> +
> +When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
> +expected to occur rather than H_PUT_TCE which supports only one TCE entry
> +per call.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..85d8f26 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>  	spinlock_t tbacct_lock;
>  	u64 busy_stolen;
>  	u64 busy_preempt;
> +
> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>  #endif
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index a5287fe..e852921b 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce *args);
> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> -			     unsigned long ioba, unsigned long tce);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> +		struct kvm_vcpu *vcpu, unsigned long liobn);
> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages);
>  extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>  				struct kvm_allocate_rma *rma);
>  extern struct kvmppc_linear_info *kvm_alloc_rma(void);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index b2d3f3b..06b7b20 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -36,8 +37,11 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>  
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
> @@ -148,3 +152,117 @@ fail:
>  	}
>  	return ret;
>  }
> +
> +/* Converts guest physical address into host virtual */
> +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)
> +{
> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
> +	struct kvm_memory_slot *memslot;
> +
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
> +	return (void *) hva;
> +}
> +
> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	long ret;
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if (ioba >= tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce);
> +	if (ret)
> +		return ret;
> +
> +	kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +	unsigned long __user *tces;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/*
> +	 * The spec says that the maximum size of the list is 512 TCEs so
> +	 * so the whole table addressed resides in 4K page
> +	 */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	if (tce_list & ~IOMMU_PAGE_MASK)
> +		return H_PARAMETER;
> +
> +	tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (get_user(vcpu->arch.tce_tmp[i], tces + i))
> +			return H_PARAMETER;
> +
> +		ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_emulated_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT),
> +				vcpu->arch.tce_tmp[i]);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce_value);
> +	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..c68d538 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -35,42 +36,249 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - *          mode on PR KVM
> +/* Finds a TCE table descriptor by LIOBN */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> +		unsigned long liobn)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == liobn)
> +			return tt;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +/*
> + * Validate TCE address.
> + * At the moment only flags are validated
> + * as other check will significantly slow down
> + * or can make it even impossible to handle TCE requests
> + * in real mode.
> + */
> +long kvmppc_emulated_validate_tce(unsigned long tce)
> +{
> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
> +
> +/*
> + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
> + * by QEMU. It puts guest TCE values into the table and expects
> + * the QEMU to convert them later in the QEMU device implementation.
> + * Wiorks in both real and virtual modes.
> + * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
>   */
> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/*
> +	 * Note on the use of page_address() in real mode,
> +	 *
> +	 * It is safe to use page_address() in real mode on ppc64 because
> +	 * page_address() is always defined as lowmem_page_address()
> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> +	 * operation and does not access page struct.
> +	 *
> +	 * Theoretically page_address() could be defined different
> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> +	 * should be enabled.
> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> +	 * is not expected to be enabled on ppc32, page_address()
> +	 * is safe for ppc32 as well.
> +	 */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	page = tt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);
> +
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +
> +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
> +			unsigned long *pte_sizep)
> +{
> +	pte_t *ptep;
> +	unsigned int shift = 0;
> +	pte_t pte, tmp, ret;
> +
> +	ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
> +	if (!ptep)
> +		return __pte(0);
> +	if (shift)
> +		*pte_sizep = 1ul << shift;
> +	else
> +		*pte_sizep = PAGE_SIZE;
> +
> +	if (!pte_present(*ptep))
> +		return __pte(0);
> +
> +	/* wait until _PAGE_BUSY is clear then set it atomically */
> +	__asm__ __volatile__ (
> +		"1:	ldarx	%0,0,%3\n"
> +		"	andi.	%1,%0,%4\n"
> +		"	bne-	1b\n"
> +		"	ori	%1,%0,%4\n"
> +		"	stdcx.	%1,0,%3\n"
> +		"	bne-	1b"
> +		: "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
> +		: "r" (ptep), "i" (_PAGE_BUSY)
> +		: "cc");
> +
> +	ret = pte;
> +
> +	return ret;
> +}
> +
> +/*
> + * Converts guest physical address into host physical address.
> + * Also returns pte and page size if the page is present in page table.
> + */
> +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)
> +{
> +	struct kvm_memory_slot *memslot;
> +	pte_t pte;
> +	unsigned long hva, hpa, pg_size = 0, offset;
> +	unsigned long gfn = gpa >> PAGE_SHIFT;
> +	bool writing = gpa & TCE_PCI_WRITE;
> +
> +	/* Find a KVM memslot */
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/* Convert guest physical address to host virtual */
> +	hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> +	/* Find a PTE and determine the size */
> +	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte)
> +		return ERROR_ADDR;
> +
> +	/* Calculate host phys address keeping flags and offset in the page */
> +	offset = gpa & (pg_size - 1);
> +
> +	/* pte_pfn(pte) should return an address aligned to pg_size */
> +	hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
> +
> +	return hpa;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
> -	struct kvmppc_spapr_tce_table *stt;
> -
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> -
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> -
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> -
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> -
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> +	long ret;
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if (ioba >= tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce);
> +	if (ret)
> +		return ret;
> +
> +	kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +	unsigned long *tces;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/*
> +	 * The spec says that the maximum size of the list is 512 TCEs so
> +	 * so the whole table addressed resides in 4K page
> +	 */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	if (tce_list & ~IOMMU_PAGE_MASK)
> +		return H_PARAMETER;
> +
> +	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
> +	if ((unsigned long)tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		ret = kvmppc_emulated_validate_tce(tces[i]);
> +		if (ret)
> +			return ret;
>  	}
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				tces[i]);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce_value);
> +	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
>  }
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 9de24f8..9d70fd8 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -566,6 +566,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_xics_hcall(vcpu, req);
>  			break;
>  		} /* fallthrough */
> +		return RESUME_HOST;
> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
> @@ -956,6 +980,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>  	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>  	kvmppc_sanity_check(vcpu);
>  
> +	/*
> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
> +	 * half executed, we first read TCEs from the user, check them and
> +	 * return error if something went wrong and only then put TCEs into
> +	 * the TCE table.
> +	 *
> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
> +	 * each (4096 bytes).
> +	 */
> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
> +	if (!vcpu->arch.tce_tmp)
> +		goto free_vcpu;
> +
>  	return vcpu;
>  
>  free_vcpu:
> @@ -978,6 +1016,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>  	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>  	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>  	spin_unlock(&vcpu->arch.vpa_update_lock);
> +	kfree(vcpu->arch.tce_tmp);
>  	kvm_vcpu_uninit(vcpu);
>  	kmem_cache_free(kvm_vcpu_cache, vcpu);
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b02f91e..d35554e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1490,6 +1490,12 @@ hcall_real_table:
>  	.long	0		/* 0x11c */
>  	.long	0		/* 0x120 */
>  	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>  hcall_real_table_end:
>  
>  ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index b24309c..3750e3c 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>  	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>  	long rc;
>  
> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  	if (rc == H_TOO_HARD)
>  		return EMULATE_FAIL;
>  	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>  		return kvmppc_h_pr_bulk_remove(vcpu);
>  	case H_PUT_TCE:
>  		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
>  	case H_CEDE:
>  		vcpu->arch.shared->msr |= MSR_EE;
>  		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 6316ee3..8465c2a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a5c86fc..5a2afda 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_IRQ_MPIC 90
>  #define KVM_CAP_PPC_RTAS 91
>  #define KVM_CAP_IRQ_XICS 92
> +#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-25  2:45     ` David Gibson
  2013-05-27  2:44       ` Alexey Kardashevskiy
@ 2013-05-27 10:23       ` Paolo Bonzini
  2013-05-27 14:26         ` Alexey Kardashevskiy
  2013-05-28 16:32       ` Scott Wood
  2 siblings, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2013-05-27 10:23 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm, Alexey Kardashevskiy, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, Scott Wood, linuxppc-dev

Il 25/05/2013 04:45, David Gibson ha scritto:
>> >+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
>> >+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
>> >+		struct kvm *kvm = filp->private_data;
>> >+
>> >+		r = -EFAULT;
>> >+		if (copy_from_user(&create_tce_iommu, argp,
>> >+				sizeof(create_tce_iommu)))
>> >+			goto out;
>> >+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>> >&create_tce_iommu);
>> >+		goto out;
>> >+	}

Would it make sense to make this the only interface for creating TCEs?
That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
no hardware IOMMU usage), and have a single ioctl for both cases?
There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
kvm_vm_ioctl_create_spapr_tce_iommu.

KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
could just use a new capability and drop the old ioctl.  I'm not sure
whether you're already considering the ABI to be stable for kvmppc.

Paolo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-27 10:23       ` Paolo Bonzini
@ 2013-05-27 14:26         ` Alexey Kardashevskiy
  2013-05-27 14:41           ` Paolo Bonzini
  0 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-27 14:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	Scott Wood, linuxppc-dev, David Gibson

On 05/27/2013 08:23 PM, Paolo Bonzini wrote:
> Il 25/05/2013 04:45, David Gibson ha scritto:
>>>> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>>> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>>> +		struct kvm *kvm = filp->private_data;
>>>> +
>>>> +		r = -EFAULT;
>>>> +		if (copy_from_user(&create_tce_iommu, argp,
>>>> +				sizeof(create_tce_iommu)))
>>>> +			goto out;
>>>> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>>> &create_tce_iommu);
>>>> +		goto out;
>>>> +	}
> 
> Would it make sense to make this the only interface for creating TCEs?
> That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
> no hardware IOMMU usage), and have a single ioctl for both cases?
> There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
> kvm_vm_ioctl_create_spapr_tce_iommu.

Just few bits. Is there really much sense in making one function from those
two? I tried, looked a bit messy.

> KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
> could just use a new capability and drop the old ioctl.

The old capability+ioctl already exist for quite a while and few QEMU
versions supporting it were released so we do not want just drop it. So
then what is the benefit of having a new interface with support of both types?

>  I'm not sure
> whether you're already considering the ABI to be stable for kvmppc.

Is any bit of KVM using it? Cannot see from Documentation/ABI.


-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-05-27 10:08   ` Paolo Bonzini
@ 2013-05-27 14:33     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-27 14:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/27/2013 08:08 PM, Paolo Bonzini wrote:
> Il 21/05/2013 05:06, Alexey Kardashevskiy ha scritto:
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
>> devices or emulated PCI.
> 
> Do you mean vio?  virtio (without getting into whether that's good or
> bad) will bypass the iommu.

yes, mistake, should be VIO. thanks!


-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-27 14:26         ` Alexey Kardashevskiy
@ 2013-05-27 14:41           ` Paolo Bonzini
  0 siblings, 0 replies; 36+ messages in thread
From: Paolo Bonzini @ 2013-05-27 14:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	Scott Wood, linuxppc-dev, David Gibson

Il 27/05/2013 16:26, Alexey Kardashevskiy ha scritto:
> On 05/27/2013 08:23 PM, Paolo Bonzini wrote:
>> Il 25/05/2013 04:45, David Gibson ha scritto:
>>>>> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>>>> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>>>> +		struct kvm *kvm = filp->private_data;
>>>>> +
>>>>> +		r = -EFAULT;
>>>>> +		if (copy_from_user(&create_tce_iommu, argp,
>>>>> +				sizeof(create_tce_iommu)))
>>>>> +			goto out;
>>>>> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>>>> &create_tce_iommu);
>>>>> +		goto out;
>>>>> +	}
>>
>> Would it make sense to make this the only interface for creating TCEs?
>> That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
>> no hardware IOMMU usage), and have a single ioctl for both cases?
>> There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
>> kvm_vm_ioctl_create_spapr_tce_iommu.
> 
> Just few bits. Is there really much sense in making one function from those
> two? I tried, looked a bit messy.

Cannot really tell without the userspace bits.  But ioctl proliferation
is what the device and one_reg APIs were supposed to avoid...

>> KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
>> could just use a new capability and drop the old ioctl.
> 
> The old capability+ioctl already exist for quite a while and few QEMU
> versions supporting it were released so we do not want just drop it. So
> then what is the benefit of having a new interface with support of both types?
> 
>>  I'm not sure
>> whether you're already considering the ABI to be stable for kvmppc.
> 
> Is any bit of KVM using it? Cannot see from Documentation/ABI.

I mean the userspace ABI (ioctls).

Paolo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-25  2:45     ` David Gibson
  2013-05-27  2:44       ` Alexey Kardashevskiy
  2013-05-27 10:23       ` Paolo Bonzini
@ 2013-05-28 16:32       ` Scott Wood
  2013-05-29  0:20         ` Alexey Kardashevskiy
  2 siblings, 1 reply; 36+ messages in thread
From: Scott Wood @ 2013-05-28 16:32 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm, Alexey Kardashevskiy, linux-kernel, kvm-ppc, Alexander Graf,
	Paul Mackerras, linuxppc-dev

On 05/24/2013 09:45:24 PM, David Gibson wrote:
> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> > On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> > >diff --git a/arch/powerpc/kvm/powerpc.c =20
> b/arch/powerpc/kvm/powerpc.c
> > >index 8465c2a..da6bf61 100644
> > >--- a/arch/powerpc/kvm/powerpc.c
> > >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > >+++ b/arch/powerpc/kvm/powerpc.c
> > > 		break;
> > > #endif
> > > 	case KVM_CAP_SPAPR_MULTITCE:
> > >+	case KVM_CAP_SPAPR_TCE_IOMMU:
> > > 		r =3D 1;
> > > 		break;
> > > 	default:
> >
> > Don't advertise SPAPR capabilities if it's not book3s -- and
> > probably there's some additional limitation that would be
> > appropriate.
>=20
> So, in the case of MULTITCE, that's not quite right.  PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.

That might (or might not; consider things like Altivec versus SPE =20
opcode conflict, whether unimplemented SPRs trap, behavior of =20
unprivileged SPRs/instructions, etc) be true in theory, but it's not =20
currently a supported configuration.  BookE KVM does not support =20
emulating a different CPU than the host.  In the unlikely case that =20
ever changes to the point of allowing PAPR guests on a BookE host, then =20
we can revisit how to properly determine whether the capability is =20
supported, but for now the capability will never be valid in the =20
CONFIG_BOOKE case (though I'd rather see it depend on an appropriate =20
book3s symbol than depend on !BOOKE).

Or we could just leave it as is, and let it indicate whether the host =20
kernel supports the feature in general, with the user needing to =20
understand when it's applicable...  I'm a bit confused by the =20
documentation, however -- the MULTITCE capability was documented in the =20
"capabilities that can be enabled" section, but I don't see where it =20
can be enabled.

-Scott=

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-27  2:44       ` Alexey Kardashevskiy
@ 2013-05-28 17:45         ` Scott Wood
  2013-05-28 23:30           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 36+ messages in thread
From: Scott Wood @ 2013-05-28 17:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/26/2013 09:44:24 PM, Alexey Kardashevskiy wrote:
> On 05/25/2013 12:45 PM, David Gibson wrote:
> > On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> >> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> >>> diff --git a/arch/powerpc/kvm/powerpc.c =20
> b/arch/powerpc/kvm/powerpc.c
> >>> index 8465c2a..da6bf61 100644
> >>> --- a/arch/powerpc/kvm/powerpc.c
> >>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >>> +++ b/arch/powerpc/kvm/powerpc.c
> >>> 		break;
> >>> #endif
> >>> 	case KVM_CAP_SPAPR_MULTITCE:
> >>> +	case KVM_CAP_SPAPR_TCE_IOMMU:
> >>> 		r =3D 1;
> >>> 		break;
> >>> 	default:
> >>
> >> Don't advertise SPAPR capabilities if it's not book3s -- and
> >> probably there's some additional limitation that would be
> >> appropriate.
> >
> > So, in the case of MULTITCE, that's not quite right.  PR KVM can
> > emulate a PAPR system on a BookE machine, and there's no reason not =20
> to
> > allow TCE acceleration as well.  We can't make it dependent on PAPR
> > mode being selected, because that's enabled per-vcpu, whereas these
> > capabilities are queried on the VM before the vcpus are created.
> >
> > CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
> > host side hardware (i.e. a PAPR style IOMMU), though.
>=20
>=20
> The capability says that the ioctl is supported. If there is no IOMMU =20
> group
> registered, than it will fail with a reasonable error and nobody gets =20
> hurt.
> What is the problem?

You could say that about a lot of the capabilities that just advertise =20
the existence of new ioctls. :-)

Sometimes it's nice to know in advance whether it's supported, before =20
actually requesting that something happen.

> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >>> #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct
> >>> kvm_device_attr)
> >>> #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct
> >>> kvm_device_attr)
> >>>
> >>> +/* ioctl for SPAPR TCE IOMMU */
> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
> >>> kvm_create_spapr_tce_iommu)
> >>
> >> Shouldn't this go under the vm ioctl section?
>=20
>=20
> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated =20
> devices) is
> in this section so I decided to keep them together. Wrong?

You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with =20
KVM_CREATE_SPAPR_TCE_IOMMU?

-Scott=

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-28 17:45         ` Scott Wood
@ 2013-05-28 23:30           ` Alexey Kardashevskiy
  2013-05-28 23:35             ` Scott Wood
  0 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-28 23:30 UTC (permalink / raw)
  To: Scott Wood
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/29/2013 03:45 AM, Scott Wood wrote:
> On 05/26/2013 09:44:24 PM, Alexey Kardashevskiy wrote:
>> On 05/25/2013 12:45 PM, David Gibson wrote:
>> > On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> >> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>> >>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> >>> index 8465c2a..da6bf61 100644
>> >>> --- a/arch/powerpc/kvm/powerpc.c
>> >>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>> >>> +++ b/arch/powerpc/kvm/powerpc.c
>> >>>         break;
>> >>> #endif
>> >>>     case KVM_CAP_SPAPR_MULTITCE:
>> >>> +    case KVM_CAP_SPAPR_TCE_IOMMU:
>> >>>         r = 1;
>> >>>         break;
>> >>>     default:
>> >>
>> >> Don't advertise SPAPR capabilities if it's not book3s -- and
>> >> probably there's some additional limitation that would be
>> >> appropriate.
>> >
>> > So, in the case of MULTITCE, that's not quite right.  PR KVM can
>> > emulate a PAPR system on a BookE machine, and there's no reason not to
>> > allow TCE acceleration as well.  We can't make it dependent on PAPR
>> > mode being selected, because that's enabled per-vcpu, whereas these
>> > capabilities are queried on the VM before the vcpus are created.
>> >
>> > CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
>> > host side hardware (i.e. a PAPR style IOMMU), though.
>>
>>
>> The capability says that the ioctl is supported. If there is no IOMMU group
>> registered, than it will fail with a reasonable error and nobody gets hurt.
>> What is the problem?
> 
> You could say that about a lot of the capabilities that just advertise the
> existence of new ioctls. :-)
> 
> Sometimes it's nice to know in advance whether it's supported, before
> actually requesting that something happen.

Yes, would be nice. There is just no quick way to know if this real system
supports IOMMU groups. I could add another helper to generic IOMMU code
which would return the number of registered IOMMU groups but it is a bit
too much :)


>> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
>> >>> kvm_device_attr)
>> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
>> >>> kvm_device_attr)
>> >>>
>> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>> >>> kvm_create_spapr_tce_iommu)
>> >>
>> >> Shouldn't this go under the vm ioctl section?
>>
>>
>> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is
>> in this section so I decided to keep them together. Wrong?
> 
> You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> KVM_CREATE_SPAPR_TCE_IOMMU?

Yes.


-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-28 23:30           ` Alexey Kardashevskiy
@ 2013-05-28 23:35             ` Scott Wood
  2013-05-29  0:12               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 36+ messages in thread
From: Scott Wood @ 2013-05-28 23:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
> >> >>> kvm_device_attr)
> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
> >> >>> kvm_device_attr)
> >> >>>
> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
> >> >>> kvm_create_spapr_tce_iommu)
> >> >>
> >> >> Shouldn't this go under the vm ioctl section?
> >>
> >>
> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated =20
> devices) is
> >> in this section so I decided to keep them together. Wrong?
> >
> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> > KVM_CREATE_SPAPR_TCE_IOMMU?
>=20
> Yes.

Sigh.  That's the same thing repeated.  There's only one IOCTL.  =20
Nothing is being "kept together".

-Scott=

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-28 23:35             ` Scott Wood
@ 2013-05-29  0:12               ` Alexey Kardashevskiy
  2013-05-29 20:05                 ` Scott Wood
  0 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-29  0:12 UTC (permalink / raw)
  To: Scott Wood
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/29/2013 09:35 AM, Scott Wood wrote:
> On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
>> >> >>> kvm_device_attr)
>> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
>> >> >>> kvm_device_attr)
>> >> >>>
>> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>> >> >>> kvm_create_spapr_tce_iommu)
>> >> >>
>> >> >> Shouldn't this go under the vm ioctl section?
>> >>
>> >>
>> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> devices) is
>> >> in this section so I decided to keep them together. Wrong?
>> >
>> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> > KVM_CREATE_SPAPR_TCE_IOMMU?
>>
>> Yes.
> 
> Sigh.  That's the same thing repeated.  There's only one IOCTL.  Nothing is
> being "kept together".

Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.


-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-28 16:32       ` Scott Wood
@ 2013-05-29  0:20         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-29  0:20 UTC (permalink / raw)
  To: Scott Wood
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/29/2013 02:32 AM, Scott Wood wrote:
> On 05/24/2013 09:45:24 PM, David Gibson wrote:
>> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> > On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>> > >diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> > >index 8465c2a..da6bf61 100644
>> > >--- a/arch/powerpc/kvm/powerpc.c
>> > >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>> > >+++ b/arch/powerpc/kvm/powerpc.c
>> > >         break;
>> > > #endif
>> > >     case KVM_CAP_SPAPR_MULTITCE:
>> > >+    case KVM_CAP_SPAPR_TCE_IOMMU:
>> > >         r = 1;
>> > >         break;
>> > >     default:
>> >
>> > Don't advertise SPAPR capabilities if it's not book3s -- and
>> > probably there's some additional limitation that would be
>> > appropriate.
>>
>> So, in the case of MULTITCE, that's not quite right.  PR KVM can
>> emulate a PAPR system on a BookE machine, and there's no reason not to
>> allow TCE acceleration as well.
> 
> That might (or might not; consider things like Altivec versus SPE opcode
> conflict, whether unimplemented SPRs trap, behavior of unprivileged
> SPRs/instructions, etc) be true in theory, but it's not currently a
> supported configuration.  BookE KVM does not support emulating a different
> CPU than the host.  In the unlikely case that ever changes to the point of
> allowing PAPR guests on a BookE host, then we can revisit how to properly
> determine whether the capability is supported, but for now the capability
> will never be valid in the CONFIG_BOOKE case (though I'd rather see it
> depend on an appropriate book3s symbol than depend on !BOOKE).
> 
> Or we could just leave it as is, and let it indicate whether the host
> kernel supports the feature in general, with the user needing to understand
> when it's applicable...  I'm a bit confused by the documentation, however
> -- the MULTITCE capability was documented in the "capabilities that can be
> enabled" section, but I don't see where it can be enabled.


True, it cannot be enabled (but it could be enabled a long time ago), it is
either supported or not, I'll fix the documentation. Thanks!


-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29  0:12               ` Alexey Kardashevskiy
@ 2013-05-29 20:05                 ` Scott Wood
  2013-05-29 23:10                   ` Alexey Kardashevskiy
  0 siblings, 1 reply; 36+ messages in thread
From: Scott Wood @ 2013-05-29 20:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
> On 05/29/2013 09:35 AM, Scott Wood wrote:
> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
> >> >> >>> kvm_device_attr)
> >> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
> >> >> >>> kvm_device_attr)
> >> >> >>>
> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, =20
> struct
> >> >> >>> kvm_create_spapr_tce_iommu)
> >> >> >>
> >> >> >> Shouldn't this go under the vm ioctl section?
> >> >>
> >> >>
> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
> >> devices) is
> >> >> in this section so I decided to keep them together. Wrong?
> >> >
> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
> >>
> >> Yes.
> >
> > Sigh.  That's the same thing repeated.  There's only one IOCTL.  =20
> Nothing is
> > being "kept together".
>=20
> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.

But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.  =20
0xe0 begins a different section.

-Scott=

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29 20:05                 ` Scott Wood
@ 2013-05-29 23:10                   ` Alexey Kardashevskiy
  2013-05-29 23:14                     ` Scott Wood
  0 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-29 23:10 UTC (permalink / raw)
  To: Scott Wood
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/30/2013 06:05 AM, Scott Wood wrote:
> On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
>> On 05/29/2013 09:35 AM, Scott Wood wrote:
>> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
>> >> >> >>> kvm_device_attr)
>> >> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
>> >> >> >>> kvm_device_attr)
>> >> >> >>>
>> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>> >> >> >>> kvm_create_spapr_tce_iommu)
>> >> >> >>
>> >> >> >> Shouldn't this go under the vm ioctl section?
>> >> >>
>> >> >>
>> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> >> devices) is
>> >> >> in this section so I decided to keep them together. Wrong?
>> >> >
>> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
>> >>
>> >> Yes.
>> >
>> > Sigh.  That's the same thing repeated.  There's only one IOCTL. 
>> Nothing is
>> > being "kept together".
>>
>> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
> 
> But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.  0xe0
> begins a different section.

It is not really obvious that there are sections as no comment defines
those :) But yes, makes sense to move it up a bit and change the code to 0xad.



-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29 23:10                   ` Alexey Kardashevskiy
@ 2013-05-29 23:14                     ` Scott Wood
  2013-05-29 23:29                       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 36+ messages in thread
From: Scott Wood @ 2013-05-29 23:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
> On 05/30/2013 06:05 AM, Scott Wood wrote:
> > On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
> >> On 05/29/2013 09:35 AM, Scott Wood wrote:
> >> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, =20
> struct
> >> >> >> >>> kvm_device_attr)
> >> >> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, =20
> struct
> >> >> >> >>> kvm_device_attr)
> >> >> >> >>>
> >> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, =20
> struct
> >> >> >> >>> kvm_create_spapr_tce_iommu)
> >> >> >> >>
> >> >> >> >> Shouldn't this go under the vm ioctl section?
> >> >> >>
> >> >> >>
> >> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for =20
> emulated
> >> >> devices) is
> >> >> >> in this section so I decided to keep them together. Wrong?
> >> >> >
> >> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> >> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
> >> >>
> >> >> Yes.
> >> >
> >> > Sigh.  That's the same thing repeated.  There's only one IOCTL.
> >> Nothing is
> >> > being "kept together".
> >>
> >> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
> >
> > But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.  =20
> 0xe0
> > begins a different section.
>=20
> It is not really obvious that there are sections as no comment defines
> those :)

There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE */

Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with the =20
ioctl number conflict mess in the vm-ioctl section, but at least that =20
one is related to the device control API. :-)

> But yes, makes sense to move it up a bit and change the code to 0xad.

0xad is KVM_KVMCLOCK_CTRL

-Scott=

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29 23:14                     ` Scott Wood
@ 2013-05-29 23:29                       ` Alexey Kardashevskiy
  2013-05-29 23:32                         ` Scott Wood
  0 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-29 23:29 UTC (permalink / raw)
  To: Scott Wood
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/30/2013 09:14 AM, Scott Wood wrote:
> On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
>> On 05/30/2013 06:05 AM, Scott Wood wrote:
>> > On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
>> >> On 05/29/2013 09:35 AM, Scott Wood wrote:
>> >> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >> >> >>> #define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xe2, struct
>> >> >> >> >>> kvm_device_attr)
>> >> >> >> >>> #define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xe3, struct
>> >> >> >> >>> kvm_device_attr)
>> >> >> >> >>>
>> >> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
>> >> >> >> >>> kvm_create_spapr_tce_iommu)
>> >> >> >> >>
>> >> >> >> >> Shouldn't this go under the vm ioctl section?
>> >> >> >>
>> >> >> >>
>> >> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> >> >> devices) is
>> >> >> >> in this section so I decided to keep them together. Wrong?
>> >> >> >
>> >> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> >> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
>> >> >>
>> >> >> Yes.
>> >> >
>> >> > Sigh.  That's the same thing repeated.  There's only one IOCTL.
>> >> Nothing is
>> >> > being "kept together".
>> >>
>> >> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
>> >
>> > But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.  0xe0
>> > begins a different section.
>>
>> It is not really obvious that there are sections as no comment defines
>> those :)
> 
> There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE */
> 
> Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with the
> ioctl number conflict mess in the vm-ioctl section, but at least that one
> is related to the device control API. :-)
> 
>> But yes, makes sense to move it up a bit and change the code to 0xad.
> 
> 0xad is KVM_KVMCLOCK_CTRL

That's it. I am _completely_ confused now. No system whatsoever :(
What rule should I use in order to choose the number for my new ioctl? :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-29 23:29                       ` Alexey Kardashevskiy
@ 2013-05-29 23:32                         ` Scott Wood
  0 siblings, 0 replies; 36+ messages in thread
From: Scott Wood @ 2013-05-29 23:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linuxppc-dev, David Gibson

On 05/29/2013 06:29:13 PM, Alexey Kardashevskiy wrote:
> On 05/30/2013 09:14 AM, Scott Wood wrote:
> > On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
> >> On 05/30/2013 06:05 AM, Scott Wood wrote:
> >> > But you didn't put it in the same section as =20
> KVM_CREATE_SPAPR_TCE.  0xe0
> >> > begins a different section.
> >>
> >> It is not really obvious that there are sections as no comment =20
> defines
> >> those :)
> >
> > There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE =20
> */
> >
> > Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with =20
> the
> > ioctl number conflict mess in the vm-ioctl section, but at least =20
> that one
> > is related to the device control API. :-)
> >
> >> But yes, makes sense to move it up a bit and change the code to =20
> 0xad.
> >
> > 0xad is KVM_KVMCLOCK_CTRL
>=20
> That's it. I am _completely_ confused now. No system whatsoever :(
> What rule should I use in order to choose the number for my new =20
> ioctl? :)

Yeah, it's a mess.  0xaf seems to be free. :-)

-Scott=

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17 10:46               ` Alexander Graf
@ 2013-06-17 10:48                 ` Alexander Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Graf @ 2013-06-17 10:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson

On 06/17/2013 12:46 PM, Alexander Graf wrote:
> On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote:
>> On 06/17/2013 06:40 PM, Alexander Graf wrote:
>>> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:
>>>
>>>> On 06/17/2013 06:02 PM, Alexander Graf wrote:
>>>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
>>>>>
>>>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as
>>>>>>>> IBMVIO devices or emulated PCI.  These calls allow adding
>>>>>>>> multiple entries (up to 512) into the TCE table in one call
>>>>>>>> which saves time on transition to/from real mode.
>>>>>>>>
>>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>>>>> (copied from user and verified) before writing the whole list
>>>>>>>> into the TCE table. This cache will be utilized more in the
>>>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in
>>>>>>>> the virtual mode in the case if the real mode handler failed
>>>>>>>> for some reason.
>>>>>>>>
>>>>>>>> This adds a guest physical to host real address converter and
>>>>>>>> calls the existing H_PUT_TCE handler. The converting function
>>>>>>>> is going to be fully utilized by upcoming VFIO supporting
>>>>>>>> patches.
>>>>>>>>
>>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so
>>>>>>>> in order to support the functionality of this patch, QEMU
>>>>>>>> needs to query for this capability and set the
>>>>>>>> "hcall-multi-tce" hypertas property only if the capability is
>>>>>>>> present, otherwise there will be serious performance
>>>>>>>> degradation.
>>>>>>>>
>>>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au>  Signed-off-by:
>>>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru>  Signed-off-by: Paul
>>>>>>>> Mackerras<paulus@samba.org>
>>>>>>> Only a few minor nits. Ben already commented on implementation
>>>>>>> details.
>>>>>>>
>>>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the
>>>>>>>> commit message * updated doc and moved it to another section *
>>>>>>>> changed capability number
>>>>>>>>
>>>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup
>>>>>>>> if put_indirect failed, instead we do not even start writing
>>>>>>>> to TCE table if we cannot get TCEs from the user and they are
>>>>>>>> invalid * kvmppc_emulated_h_put_tce is split to
>>>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for
>>>>>>>> the previous item) * fixed bug with failthrough for H_IPI *
>>>>>>>> removed all get_user() from real mode handlers *
>>>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte
>>>>>>>> public) --- Documentation/virtual/kvm/api.txt       |   17 ++
>>>>>>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>>>>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>>>>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266
>>>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c
>>>>>>>> |   39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>>>>>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>>>>>>> include/uapi/linux/kvm.h                |    1 + 10 files
>>>>>>>> changed, 473 insertions(+), 32 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff
>>>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++
>>>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@
>>>>>>>> calls by the guest for that service will be passed to
>>>>>>>> userspace to be handled.
>>>>>>>>
>>>>>>>>
>>>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability:
>>>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This
>>>>>>>> capability tells the guest that multiple TCE entry add/remove
>>>>>>>> hypercalls +handling is supported by the kernel. This
>>>>>>>> significanly accelerates DMA +operations for PPC KVM guests.
>>>>>>>> + +Unlike other capabilities in this section, this one does
>>>>>>>> not have an ioctl. +Instead, when the capability is present,
>>>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be
>>>>>>>> handled in the host kernel and not passed to +the guest.
>>>>>>>> Othwerwise it might be better for the guest to continue using
>>>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or
>>>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>>>>> While this describes perfectly well what the consequences are of
>>>>>>> the patches, it does not describe properly what the CAP actually
>>>>>>> expresses. The CAP only says "this kernel is able to handle
>>>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All
>>>>>>> other consequences are nice to document, but the semantics of
>>>>>>> the CAP are missing.
>>>>>>
>>>>>> ? It expresses ability to handle 2 hcalls. What is missing?
>>>>> You don't describe the kvm<->  qemu interface. You describe some
>>>>> decisions qemu can take from this cap.
>>>>
>>>> This file does not mention qemu at all. And the interface is - qemu
>>>> (or kvmtool could do that) just adds "hcall-multi-tce" to
>>>> "ibm,hypertas-functions" but this is for pseries linux and AIX could
>>>> always do it (no idea about it). Does it really have to be in this
>>>> file?
>>> Ok, let's go back a step. What does this CAP describe? Don't look at 
>>> the
>>> description you wrote above. Just write a new one.
>> The CAP means the kernel is capable of handling hcalls A and B without
>> passing those into the user space. That accelerates DMA.
>>
>>
>>> What exactly can user space expect when it finds this CAP?
>> The user space can expect that its handlers for A and B are not going 
>> to be
>> called if it configures the guest appropriately.

Actually a nitpick here too. User space can expect that its handlers for 
A and B are going to already be processed by KVM. Regardless of how user 
space configures the guest.


Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:51             ` Alexey Kardashevskiy
@ 2013-06-17 10:46               ` Alexander Graf
  2013-06-17 10:48                 ` Alexander Graf
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Graf @ 2013-06-17 10:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson

On 06/17/2013 10:51 AM, Alexey Kardashevskiy wrote:
> On 06/17/2013 06:40 PM, Alexander Graf wrote:
>> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:
>>
>>> On 06/17/2013 06:02 PM, Alexander Graf wrote:
>>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
>>>>
>>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as
>>>>>>> IBMVIO devices or emulated PCI.  These calls allow adding
>>>>>>> multiple entries (up to 512) into the TCE table in one call
>>>>>>> which saves time on transition to/from real mode.
>>>>>>>
>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>>>> (copied from user and verified) before writing the whole list
>>>>>>> into the TCE table. This cache will be utilized more in the
>>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in
>>>>>>> the virtual mode in the case if the real mode handler failed
>>>>>>> for some reason.
>>>>>>>
>>>>>>> This adds a guest physical to host real address converter and
>>>>>>> calls the existing H_PUT_TCE handler. The converting function
>>>>>>> is going to be fully utilized by upcoming VFIO supporting
>>>>>>> patches.
>>>>>>>
>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so
>>>>>>> in order to support the functionality of this patch, QEMU
>>>>>>> needs to query for this capability and set the
>>>>>>> "hcall-multi-tce" hypertas property only if the capability is
>>>>>>> present, otherwise there will be serious performance
>>>>>>> degradation.
>>>>>>>
>>>>>>> Cc: David Gibson<david@gibson.dropbear.id.au>  Signed-off-by:
>>>>>>> Alexey Kardashevskiy<aik@ozlabs.ru>  Signed-off-by: Paul
>>>>>>> Mackerras<paulus@samba.org>
>>>>>> Only a few minor nits. Ben already commented on implementation
>>>>>> details.
>>>>>>
>>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the
>>>>>>> commit message * updated doc and moved it to another section *
>>>>>>> changed capability number
>>>>>>>
>>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup
>>>>>>> if put_indirect failed, instead we do not even start writing
>>>>>>> to TCE table if we cannot get TCEs from the user and they are
>>>>>>> invalid * kvmppc_emulated_h_put_tce is split to
>>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for
>>>>>>> the previous item) * fixed bug with failthrough for H_IPI *
>>>>>>> removed all get_user() from real mode handlers *
>>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte
>>>>>>> public) --- Documentation/virtual/kvm/api.txt       |   17 ++
>>>>>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>>>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>>>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266
>>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c
>>>>>>> |   39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>>>>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>>>>>> include/uapi/linux/kvm.h                |    1 + 10 files
>>>>>>> changed, 473 insertions(+), 32 deletions(-)
>>>>>>>
>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff
>>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++
>>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@
>>>>>>> calls by the guest for that service will be passed to
>>>>>>> userspace to be handled.
>>>>>>>
>>>>>>>
>>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability:
>>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This
>>>>>>> capability tells the guest that multiple TCE entry add/remove
>>>>>>> hypercalls +handling is supported by the kernel. This
>>>>>>> significanly accelerates DMA +operations for PPC KVM guests.
>>>>>>> + +Unlike other capabilities in this section, this one does
>>>>>>> not have an ioctl. +Instead, when the capability is present,
>>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be
>>>>>>> handled in the host kernel and not passed to +the guest.
>>>>>>> Othwerwise it might be better for the guest to continue using
>>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or
>>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>>>> While this describes perfectly well what the consequences are of
>>>>>> the patches, it does not describe properly what the CAP actually
>>>>>> expresses. The CAP only says "this kernel is able to handle
>>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All
>>>>>> other consequences are nice to document, but the semantics of
>>>>>> the CAP are missing.
>>>>>
>>>>> ? It expresses ability to handle 2 hcalls. What is missing?
>>>> You don't describe the kvm<->  qemu interface. You describe some
>>>> decisions qemu can take from this cap.
>>>
>>> This file does not mention qemu at all. And the interface is - qemu
>>> (or kvmtool could do that) just adds "hcall-multi-tce" to
>>> "ibm,hypertas-functions" but this is for pseries linux and AIX could
>>> always do it (no idea about it). Does it really have to be in this
>>> file?
>> Ok, let's go back a step. What does this CAP describe? Don't look at the
>> description you wrote above. Just write a new one.
> The CAP means the kernel is capable of handling hcalls A and B without
> passing those into the user space. That accelerates DMA.
>
>
>> What exactly can user space expect when it finds this CAP?
> The user space can expect that its handlers for A and B are not going to be
> called if it configures the guest appropriately.
>
> Any better? :)

A lot, yes. This is what the CAP actually means.

It's nice to give some guidance in the documentation of implications 
(should expose "ibm,hypertas-functions" to enable the guest to actually 
use these for example) but the first paragraph should only indicate what 
the CAP changes.


Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:40           ` Alexander Graf
@ 2013-06-17  8:51             ` Alexey Kardashevskiy
  2013-06-17 10:46               ` Alexander Graf
  0 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-17  8:51 UTC (permalink / raw)
  To: Alexander Graf
  Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson

On 06/17/2013 06:40 PM, Alexander Graf wrote:
> 
> On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:
> 
>> On 06/17/2013 06:02 PM, Alexander Graf wrote:
>>> 
>>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
>>> 
>>>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>>> 
>>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>>> 
>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and 
>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as
>>>>>> IBMVIO devices or emulated PCI.  These calls allow adding
>>>>>> multiple entries (up to 512) into the TCE table in one call
>>>>>> which saves time on transition to/from real mode.
>>>>>> 
>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs 
>>>>>> (copied from user and verified) before writing the whole list
>>>>>> into the TCE table. This cache will be utilized more in the
>>>>>> upcoming VFIO/IOMMU support to continue TCE list processing in
>>>>>> the virtual mode in the case if the real mode handler failed
>>>>>> for some reason.
>>>>>> 
>>>>>> This adds a guest physical to host real address converter and
>>>>>> calls the existing H_PUT_TCE handler. The converting function 
>>>>>> is going to be fully utilized by upcoming VFIO supporting
>>>>>> patches.
>>>>>> 
>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, so
>>>>>> in order to support the functionality of this patch, QEMU 
>>>>>> needs to query for this capability and set the
>>>>>> "hcall-multi-tce" hypertas property only if the capability is
>>>>>> present, otherwise there will be serious performance
>>>>>> degradation.
>>>>>> 
>>>>>> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by:
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul
>>>>>> Mackerras <paulus@samba.org>
>>>>> 
>>>>> Only a few minor nits. Ben already commented on implementation
>>>>> details.
>>>>> 
>>>>>> 
>>>>>> --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the
>>>>>> commit message * updated doc and moved it to another section *
>>>>>> changed capability number
>>>>>> 
>>>>>> 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup
>>>>>> if put_indirect failed, instead we do not even start writing
>>>>>> to TCE table if we cannot get TCEs from the user and they are 
>>>>>> invalid * kvmppc_emulated_h_put_tce is split to
>>>>>> kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for
>>>>>> the previous item) * fixed bug with failthrough for H_IPI *
>>>>>> removed all get_user() from real mode handlers *
>>>>>> kvmppc_lookup_pte() added (instead of making lookup_linux_pte
>>>>>> public) --- Documentation/virtual/kvm/api.txt       |   17 ++ 
>>>>>> arch/powerpc/include/asm/kvm_host.h     |    2 + 
>>>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +- 
>>>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++ 
>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266
>>>>>> +++++++++++++++++++++++++++---- arch/powerpc/kvm/book3s_hv.c
>>>>>> |   39 +++++ arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 + 
>>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++- 
>>>>>> arch/powerpc/kvm/powerpc.c              |    3 + 
>>>>>> include/uapi/linux/kvm.h                |    1 + 10 files
>>>>>> changed, 473 insertions(+), 32 deletions(-)
>>>>>> 
>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>> b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff
>>>>>> 100644 --- a/Documentation/virtual/kvm/api.txt +++
>>>>>> b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@
>>>>>> calls by the guest for that service will be passed to
>>>>>> userspace to be handled.
>>>>>> 
>>>>>> 
>>>>>> +4.83 KVM_CAP_PPC_MULTITCE + +Capability:
>>>>>> KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This
>>>>>> capability tells the guest that multiple TCE entry add/remove
>>>>>> hypercalls +handling is supported by the kernel. This
>>>>>> significanly accelerates DMA +operations for PPC KVM guests. 
>>>>>> + +Unlike other capabilities in this section, this one does
>>>>>> not have an ioctl. +Instead, when the capability is present,
>>>>>> the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be
>>>>>> handled in the host kernel and not passed to +the guest.
>>>>>> Othwerwise it might be better for the guest to continue using
>>>>>> H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or
>>>>>> KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>>> 
>>>> 
>>>>> While this describes perfectly well what the consequences are of
>>>>> the patches, it does not describe properly what the CAP actually
>>>>> expresses. The CAP only says "this kernel is able to handle
>>>>> H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly". All
>>>>> other consequences are nice to document, but the semantics of
>>>>> the CAP are missing.
>>>> 
>>>> 
>>>> ? It expresses ability to handle 2 hcalls. What is missing?
>>> 
>>> You don't describe the kvm <-> qemu interface. You describe some
>>> decisions qemu can take from this cap.
>> 
>> 
>> This file does not mention qemu at all. And the interface is - qemu
>> (or kvmtool could do that) just adds "hcall-multi-tce" to 
>> "ibm,hypertas-functions" but this is for pseries linux and AIX could
>> always do it (no idea about it). Does it really have to be in this
>> file?
> 

> Ok, let's go back a step. What does this CAP describe? Don't look at the
> description you wrote above. Just write a new one.

The CAP means the kernel is capable of handling hcalls A and B without
passing those into the user space. That accelerates DMA.


> What exactly can user space expect when it finds this CAP?

The user space can expect that its handlers for A and B are not going to be
called if it configures the guest appropriately.

Any better? :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:37       ` Benjamin Herrenschmidt
@ 2013-06-17  8:42         ` Alexander Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Graf @ 2013-06-17  8:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, linux-kernel, kvm-ppc, Paul Mackerras,
	linuxppc-dev, David Gibson


On 17.06.2013, at 10:37, Benjamin Herrenschmidt wrote:

> On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote:
>> David:
>> =3D=3D=3D
>> So, in the case of MULTITCE, that's not quite right.  PR KVM can
>> emulate a PAPR system on a BookE machine, and there's no reason not =
to
>> allow TCE acceleration as well.  We can't make it dependent on PAPR
>> mode being selected, because that's enabled per-vcpu, whereas these
>> capabilities are queried on the VM before the vcpus are created.
>> =3D=3D=3D
>>=20
>> Wrong?
>=20
> The capability just tells qemu the kernel supports it, it doesn't have
> to depend on PAPR mode, qemu can sort things out no ?

Yes, this goes hand-in-hand with the documentation bit I'm trying to get =
through to Alexey atm. The CAP merely says that if in PAPR mode the =
kernel can handle hypercalls X and Y itself.

This is true for all book3s implementations as the patches stand. It is =
not true for BookE as the patches stand. Hence the CAP should be limited =
to book3s, regardless of its mode :).


Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:34         ` Alexey Kardashevskiy
@ 2013-06-17  8:40           ` Alexander Graf
  2013-06-17  8:51             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Graf @ 2013-06-17  8:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson


On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:

> On 06/17/2013 06:02 PM, Alexander Graf wrote:
>>=20
>> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
>>=20
>>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>>=20
>>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>>=20
>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>>> devices or emulated PCI.  These calls allow adding multiple =
entries
>>>>> (up to 512) into the TCE table in one call which saves time on
>>>>> transition to/from real mode.
>>>>>=20
>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>> (copied from user and verified) before writing the whole list into
>>>>> the TCE table. This cache will be utilized more in the upcoming
>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>>> mode in the case if the real mode handler failed for some reason.
>>>>>=20
>>>>> This adds a guest physical to host real address converter
>>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>>=20
>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>>> so in order to support the functionality of this patch, QEMU
>>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>>> hypertas property only if the capability is present, otherwise
>>>>> there will be serious performance degradation.
>>>>>=20
>>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>>=20
>>>> Only a few minor nits. Ben already commented on implementation =
details.
>>>>=20
>>>>>=20
>>>>> ---
>>>>> Changelog:
>>>>> 2013/06/05:
>>>>> * fixed mistype about IBMVIO in the commit message
>>>>> * updated doc and moved it to another section
>>>>> * changed capability number
>>>>>=20
>>>>> 2013/05/21:
>>>>> * added kvm_vcpu_arch::tce_tmp
>>>>> * removed cleanup if put_indirect failed, instead we do not even =
start
>>>>> writing to TCE table if we cannot get TCEs from the user and they =
are
>>>>> invalid
>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>>> * fixed bug with failthrough for H_IPI
>>>>> * removed all get_user() from real mode handlers
>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte =
public)
>>>>> ---
>>>>> Documentation/virtual/kvm/api.txt       |   17 ++
>>>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 =
+++++++++++++++++++++++++++----
>>>>> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>>>> include/uapi/linux/kvm.h                |    1 +
>>>>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>>>>=20
>>>>> diff --git a/Documentation/virtual/kvm/api.txt =
b/Documentation/virtual/kvm/api.txt
>>>>> index 5f91eda..6c082ff 100644
>>>>> --- a/Documentation/virtual/kvm/api.txt
>>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be =
passed to userspace to be
>>>>> handled.
>>>>>=20
>>>>>=20
>>>>> +4.83 KVM_CAP_PPC_MULTITCE
>>>>> +
>>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>>> +Architectures: ppc
>>>>> +Type: vm
>>>>> +
>>>>> +This capability tells the guest that multiple TCE entry =
add/remove hypercalls
>>>>> +handling is supported by the kernel. This significanly =
accelerates DMA
>>>>> +operations for PPC KVM guests.
>>>>> +
>>>>> +Unlike other capabilities in this section, this one does not have =
an ioctl.
>>>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT =
and
>>>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and =
not passed to
>>>>> +the guest. Othwerwise it might be better for the guest to =
continue using H_PUT_TCE
>>>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are =
present).
>>>>=20
>>>=20
>>>> While this describes perfectly well what the consequences are of =
the
>>>> patches, it does not describe properly what the CAP actually =
expresses.
>>>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT =
and
>>>> H_STUFF_TCE hypercalls directly". All other consequences are nice =
to
>>>> document, but the semantics of the CAP are missing.
>>>=20
>>>=20
>>> ? It expresses ability to handle 2 hcalls. What is missing?
>>=20
>> You don't describe the kvm <-> qemu interface. You describe some =
decisions qemu can take from this cap.
>=20
>=20
> This file does not mention qemu at all. And the interface is - qemu =
(or
> kvmtool could do that) just adds "hcall-multi-tce" to
> "ibm,hypertas-functions" but this is for pseries linux and AIX could =
always
> do it (no idea about it). Does it really have to be in this file?

Ok, let's go back a step. What does this CAP describe? Don't look at the =
description you wrote above. Just write a new one. What exactly can user =
space expect when it finds this CAP?

>=20
>=20
>=20
>>>> We also usually try to keep KVM behavior unchanged with regards to =
older
>>>> versions until a CAP is enabled. In this case I don't think it =
matters
>>>> all that much, so I'm fine with declaring it as enabled by default.
>>>> Please document that this is a change in behavior versus older KVM
>>>> versions though.
>>>=20
>>>=20
>>> Ok!
>>>=20
>>>=20
>>>>> +
>>>>> +
>>>>> 5. The kvm_run structure
>>>>> ------------------------
>>>>>=20
>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h =
b/arch/powerpc/include/asm/kvm_host.h
>>>>> index af326cd..85d8f26 100644
>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>>>>> 	spinlock_t tbacct_lock;
>>>>> 	u64 busy_stolen;
>>>>> 	u64 busy_preempt;
>>>>> +
>>>>> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT =
hall */
>>>>> #endif
>>>>> };
>>>>=20
>>>> [...]
>>>>>=20
>>>>>=20
>>>>=20
>>>> [...]
>>>>=20
>>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c =
b/arch/powerpc/kvm/book3s_hv.c
>>>>> index 550f592..a39039a 100644
>>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu =
*vcpu)
>>>>> 			ret =3D kvmppc_xics_hcall(vcpu, req);
>>>>> 			break;
>>>>> 		} /* fallthrough */
>>>>=20
>>>> The fallthrough comment isn't accurate anymore.
>>>>=20
>>>>> +		return RESUME_HOST;
>>>>> +	case H_PUT_TCE:
>>>>> +		ret =3D kvmppc_virtmode_h_put_tce(vcpu, =
kvmppc_get_gpr(vcpu, 4),
>>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>>> +						kvmppc_get_gpr(vcpu, =
6));
>>>>> +		if (ret =3D=3D H_TOO_HARD)
>>>>> +			return RESUME_HOST;
>>>>> +		break;
>>>>> +	case H_PUT_TCE_INDIRECT:
>>>>> +		ret =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, =
kvmppc_get_gpr(vcpu, 4),
>>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>>> +						kvmppc_get_gpr(vcpu, 6),
>>>>> +						kvmppc_get_gpr(vcpu, =
7));
>>>>> +		if (ret =3D=3D H_TOO_HARD)
>>>>> +			return RESUME_HOST;
>>>>> +		break;
>>>>> +	case H_STUFF_TCE:
>>>>> +		ret =3D kvmppc_virtmode_h_stuff_tce(vcpu, =
kvmppc_get_gpr(vcpu, 4),
>>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>>> +						kvmppc_get_gpr(vcpu, 6),
>>>>> +						kvmppc_get_gpr(vcpu, =
7));
>>>>> +		if (ret =3D=3D H_TOO_HARD)
>>>>> +			return RESUME_HOST;
>>>>> +		break;
>>>>> 	default:
>>>>> 		return RESUME_HOST;
>>>>> 	}
>>>>> @@ -958,6 +982,20 @@ struct kvm_vcpu =
*kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>>>>> 	vcpu->arch.cpu_type =3D KVM_CPU_3S_64;
>>>>> 	kvmppc_sanity_check(vcpu);
>>>>>=20
>>>>> +	/*
>>>>> +	 * As we want to minimize the chance of having =
H_PUT_TCE_INDIRECT
>>>>> +	 * half executed, we first read TCEs from the user, check them =
and
>>>>> +	 * return error if something went wrong and only then put TCEs =
into
>>>>> +	 * the TCE table.
>>>>> +	 *
>>>>> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
>>>>> +	 * kmalloc as the whole TCE list can take up to 512 items 8 =
bytes
>>>>> +	 * each (4096 bytes).
>>>>> +	 */
>>>>> +	vcpu->arch.tce_tmp =3D kmalloc(4096, GFP_KERNEL);
>>>>> +	if (!vcpu->arch.tce_tmp)
>>>>> +		goto free_vcpu;
>>>>> +
>>>>> 	return vcpu;
>>>>>=20
>>>>> free_vcpu:
>>>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu =
*vcpu)
>>>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>>>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>>>>> 	spin_unlock(&vcpu->arch.vpa_update_lock);
>>>>> +	kfree(vcpu->arch.tce_tmp);
>>>>> 	kvm_vcpu_uninit(vcpu);
>>>>> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
>>>>> }
>>>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S =
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>>> index b02f91e..d35554e 100644
>>>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>>>>> 	.long	0		/* 0x11c */
>>>>> 	.long	0		/* 0x120 */
>>>>> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
>>>>> +	.long	0		/* 0x128 */
>>>>> +	.long	0		/* 0x12c */
>>>>> +	.long	0		/* 0x130 */
>>>>> +	.long	0		/* 0x134 */
>>>>> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
>>>>> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>>>>> hcall_real_table_end:
>>>>>=20
>>>>> ignore_hdec:
>>>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c =
b/arch/powerpc/kvm/book3s_pr_papr.c
>>>>> index da0e0bc..91d4b45 100644
>>>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>>>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>>>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct =
kvm_vcpu *vcpu)
>>>>> 	unsigned long tce =3D kvmppc_get_gpr(vcpu, 6);
>>>>> 	long rc;
>>>>>=20
>>>>> -	rc =3D kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>>>>> +	rc =3D kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>>>>> +	if (rc =3D=3D H_TOO_HARD)
>>>>> +		return EMULATE_FAIL;
>>>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>>>> +	return EMULATE_DONE;
>>>>> +}
>>>>> +
>>>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4);
>>>>> +	unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5);
>>>>> +	unsigned long tce =3D kvmppc_get_gpr(vcpu, 6);
>>>>> +	unsigned long npages =3D kvmppc_get_gpr(vcpu, 7);
>>>>> +	long rc;
>>>>> +
>>>>> +	rc =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>>>>> +			tce, npages);
>>>>> +	if (rc =3D=3D H_TOO_HARD)
>>>>> +		return EMULATE_FAIL;
>>>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>>>> +	return EMULATE_DONE;
>>>>> +}
>>>>> +
>>>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4);
>>>>> +	unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5);
>>>>> +	unsigned long tce_value =3D kvmppc_get_gpr(vcpu, 6);
>>>>> +	unsigned long npages =3D kvmppc_get_gpr(vcpu, 7);
>>>>> +	long rc;
>>>>> +
>>>>> +	rc =3D kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, =
npages);
>>>>> 	if (rc =3D=3D H_TOO_HARD)
>>>>> 		return EMULATE_FAIL;
>>>>> 	kvmppc_set_gpr(vcpu, 3, rc);
>>>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, =
unsigned long cmd)
>>>>> 		return kvmppc_h_pr_bulk_remove(vcpu);
>>>>> 	case H_PUT_TCE:
>>>>> 		return kvmppc_h_pr_put_tce(vcpu);
>>>>> +	case H_PUT_TCE_INDIRECT:
>>>>> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
>>>>> +	case H_STUFF_TCE:
>>>>> +		return kvmppc_h_pr_stuff_tce(vcpu);
>>>>> 	case H_CEDE:
>>>>> 		vcpu->arch.shared->msr |=3D MSR_EE;
>>>>> 		kvm_vcpu_block(vcpu);
>>>>> diff --git a/arch/powerpc/kvm/powerpc.c =
b/arch/powerpc/kvm/powerpc.c
>>>>> index 6316ee3..8465c2a 100644
>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>>>> 		r =3D 1;
>>>>> 		break;
>>>>> #endif
>>>>> +	case KVM_CAP_SPAPR_MULTITCE:
>>>>> +		r =3D 1;
>>>>=20
>>>> This should only be true for book3s.
>>>=20
>>>=20
>>> We had this discussion with v2.
>>>=20
>>> David:
>>> =3D=3D=3D
>>> So, in the case of MULTITCE, that's not quite right.  PR KVM can
>>> emulate a PAPR system on a BookE machine, and there's no reason not =
to
>>> allow TCE acceleration as well.  We can't make it dependent on PAPR
>>> mode being selected, because that's enabled per-vcpu, whereas these
>>> capabilities are queried on the VM before the vcpus are created.
>>> =3D=3D=3D
>>>=20
>>> Wrong?
>=20
>> Partially. BookE can not emulate a PAPR system as it stands today.
>=20
> Oh.
> Ok.
> So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s =
(how...)?

The former.


Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  7:55     ` Alexey Kardashevskiy
  2013-06-17  8:02       ` Alexander Graf
@ 2013-06-17  8:37       ` Benjamin Herrenschmidt
  2013-06-17  8:42         ` Alexander Graf
  1 sibling, 1 reply; 36+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-17  8:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras,
	linuxppc-dev, David Gibson

On Mon, 2013-06-17 at 17:55 +1000, Alexey Kardashevskiy wrote:
> David:
> ===
> So, in the case of MULTITCE, that's not quite right.  PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.  We can't make it dependent on PAPR
> mode being selected, because that's enabled per-vcpu, whereas these
> capabilities are queried on the VM before the vcpus are created.
> ===
> 
> Wrong?

The capability just tells qemu the kernel supports it, it doesn't have
to depend on PAPR mode, qemu can sort things out no ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  8:02       ` Alexander Graf
@ 2013-06-17  8:34         ` Alexey Kardashevskiy
  2013-06-17  8:40           ` Alexander Graf
  0 siblings, 1 reply; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-17  8:34 UTC (permalink / raw)
  To: Alexander Graf
  Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson

On 06/17/2013 06:02 PM, Alexander Graf wrote:
> 
> On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
> 
>> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>>
>>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>>
>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>> devices or emulated PCI.  These calls allow adding multiple entries
>>>> (up to 512) into the TCE table in one call which saves time on
>>>> transition to/from real mode.
>>>>
>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>> (copied from user and verified) before writing the whole list into
>>>> the TCE table. This cache will be utilized more in the upcoming
>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>> mode in the case if the real mode handler failed for some reason.
>>>>
>>>> This adds a guest physical to host real address converter
>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>
>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>> so in order to support the functionality of this patch, QEMU
>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>> hypertas property only if the capability is present, otherwise
>>>> there will be serious performance degradation.
>>>>
>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>
>>> Only a few minor nits. Ben already commented on implementation details.
>>>
>>>>
>>>> ---
>>>> Changelog:
>>>> 2013/06/05:
>>>> * fixed mistype about IBMVIO in the commit message
>>>> * updated doc and moved it to another section
>>>> * changed capability number
>>>>
>>>> 2013/05/21:
>>>> * added kvm_vcpu_arch::tce_tmp
>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>> invalid
>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>> * fixed bug with failthrough for H_IPI
>>>> * removed all get_user() from real mode handlers
>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>> ---
>>>> Documentation/virtual/kvm/api.txt       |   17 ++
>>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>>>> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>>> include/uapi/linux/kvm.h                |    1 +
>>>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index 5f91eda..6c082ff 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>>>> handled.
>>>>
>>>>
>>>> +4.83 KVM_CAP_PPC_MULTITCE
>>>> +
>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>> +Architectures: ppc
>>>> +Type: vm
>>>> +
>>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls
>>>> +handling is supported by the kernel. This significanly accelerates DMA
>>>> +operations for PPC KVM guests.
>>>> +
>>>> +Unlike other capabilities in this section, this one does not have an ioctl.
>>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
>>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
>>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
>>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
>>>
>>
>>> While this describes perfectly well what the consequences are of the
>>> patches, it does not describe properly what the CAP actually expresses.
>>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and
>>> H_STUFF_TCE hypercalls directly". All other consequences are nice to
>>> document, but the semantics of the CAP are missing.
>>
>>
>> ? It expresses ability to handle 2 hcalls. What is missing?
> 
> You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap.


This file does not mention qemu at all. And the interface is - qemu (or
kvmtool could do that) just adds "hcall-multi-tce" to
"ibm,hypertas-functions" but this is for pseries linux and AIX could always
do it (no idea about it). Does it really have to be in this file?



>>> We also usually try to keep KVM behavior unchanged with regards to older
>>> versions until a CAP is enabled. In this case I don't think it matters
>>> all that much, so I'm fine with declaring it as enabled by default.
>>> Please document that this is a change in behavior versus older KVM
>>> versions though.
>>
>>
>> Ok!
>>
>>
>>>> +
>>>> +
>>>> 5. The kvm_run structure
>>>> ------------------------
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index af326cd..85d8f26 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>>>> 	spinlock_t tbacct_lock;
>>>> 	u64 busy_stolen;
>>>> 	u64 busy_preempt;
>>>> +
>>>> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>>>> #endif
>>>> };
>>>
>>> [...]
>>>>
>>>>
>>>
>>> [...]
>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>>> index 550f592..a39039a 100644
>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>>> 			ret = kvmppc_xics_hcall(vcpu, req);
>>>> 			break;
>>>> 		} /* fallthrough */
>>>
>>> The fallthrough comment isn't accurate anymore.
>>>
>>>> +		return RESUME_HOST;
>>>> +	case H_PUT_TCE:
>>>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>> +						kvmppc_get_gpr(vcpu, 6));
>>>> +		if (ret == H_TOO_HARD)
>>>> +			return RESUME_HOST;
>>>> +		break;
>>>> +	case H_PUT_TCE_INDIRECT:
>>>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>> +						kvmppc_get_gpr(vcpu, 6),
>>>> +						kvmppc_get_gpr(vcpu, 7));
>>>> +		if (ret == H_TOO_HARD)
>>>> +			return RESUME_HOST;
>>>> +		break;
>>>> +	case H_STUFF_TCE:
>>>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>>> +						kvmppc_get_gpr(vcpu, 5),
>>>> +						kvmppc_get_gpr(vcpu, 6),
>>>> +						kvmppc_get_gpr(vcpu, 7));
>>>> +		if (ret == H_TOO_HARD)
>>>> +			return RESUME_HOST;
>>>> +		break;
>>>> 	default:
>>>> 		return RESUME_HOST;
>>>> 	}
>>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>>>> 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>>>> 	kvmppc_sanity_check(vcpu);
>>>>
>>>> +	/*
>>>> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
>>>> +	 * half executed, we first read TCEs from the user, check them and
>>>> +	 * return error if something went wrong and only then put TCEs into
>>>> +	 * the TCE table.
>>>> +	 *
>>>> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
>>>> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
>>>> +	 * each (4096 bytes).
>>>> +	 */
>>>> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
>>>> +	if (!vcpu->arch.tce_tmp)
>>>> +		goto free_vcpu;
>>>> +
>>>> 	return vcpu;
>>>>
>>>> free_vcpu:
>>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>>>> 	spin_unlock(&vcpu->arch.vpa_update_lock);
>>>> +	kfree(vcpu->arch.tce_tmp);
>>>> 	kvm_vcpu_uninit(vcpu);
>>>> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
>>>> }
>>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>> index b02f91e..d35554e 100644
>>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>>>> 	.long	0		/* 0x11c */
>>>> 	.long	0		/* 0x120 */
>>>> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
>>>> +	.long	0		/* 0x128 */
>>>> +	.long	0		/* 0x12c */
>>>> +	.long	0		/* 0x130 */
>>>> +	.long	0		/* 0x134 */
>>>> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
>>>> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>>>> hcall_real_table_end:
>>>>
>>>> ignore_hdec:
>>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
>>>> index da0e0bc..91d4b45 100644
>>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>>>> 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>>> 	long rc;
>>>>
>>>> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>>>> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>>>> +	if (rc == H_TOO_HARD)
>>>> +		return EMULATE_FAIL;
>>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>>> +	return EMULATE_DONE;
>>>> +}
>>>> +
>>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>>> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>>> +	long rc;
>>>> +
>>>> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>>>> +			tce, npages);
>>>> +	if (rc == H_TOO_HARD)
>>>> +		return EMULATE_FAIL;
>>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>>> +	return EMULATE_DONE;
>>>> +}
>>>> +
>>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>>> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
>>>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>>> +	long rc;
>>>> +
>>>> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>>>> 	if (rc == H_TOO_HARD)
>>>> 		return EMULATE_FAIL;
>>>> 	kvmppc_set_gpr(vcpu, 3, rc);
>>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>>>> 		return kvmppc_h_pr_bulk_remove(vcpu);
>>>> 	case H_PUT_TCE:
>>>> 		return kvmppc_h_pr_put_tce(vcpu);
>>>> +	case H_PUT_TCE_INDIRECT:
>>>> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
>>>> +	case H_STUFF_TCE:
>>>> +		return kvmppc_h_pr_stuff_tce(vcpu);
>>>> 	case H_CEDE:
>>>> 		vcpu->arch.shared->msr |= MSR_EE;
>>>> 		kvm_vcpu_block(vcpu);
>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>> index 6316ee3..8465c2a 100644
>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>>> 		r = 1;
>>>> 		break;
>>>> #endif
>>>> +	case KVM_CAP_SPAPR_MULTITCE:
>>>> +		r = 1;
>>>
>>> This should only be true for book3s.
>>
>>
>> We had this discussion with v2.
>>
>> David:
>> ===
>> So, in the case of MULTITCE, that's not quite right.  PR KVM can
>> emulate a PAPR system on a BookE machine, and there's no reason not to
>> allow TCE acceleration as well.  We can't make it dependent on PAPR
>> mode being selected, because that's enabled per-vcpu, whereas these
>> capabilities are queried on the VM before the vcpus are created.
>> ===
>>
>> Wrong?

> Partially. BookE can not emulate a PAPR system as it stands today.

Oh.
Ok.
So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)?




-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-17  7:55     ` Alexey Kardashevskiy
@ 2013-06-17  8:02       ` Alexander Graf
  2013-06-17  8:34         ` Alexey Kardashevskiy
  2013-06-17  8:37       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 36+ messages in thread
From: Alexander Graf @ 2013-06-17  8:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson


On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:

> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>=20
>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>=20
>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>> devices or emulated PCI.  These calls allow adding multiple entries
>>> (up to 512) into the TCE table in one call which saves time on
>>> transition to/from real mode.
>>>=20
>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>> (copied from user and verified) before writing the whole list into
>>> the TCE table. This cache will be utilized more in the upcoming
>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>> mode in the case if the real mode handler failed for some reason.
>>>=20
>>> This adds a guest physical to host real address converter
>>> and calls the existing H_PUT_TCE handler. The converting function
>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>=20
>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>> so in order to support the functionality of this patch, QEMU
>>> needs to query for this capability and set the "hcall-multi-tce"
>>> hypertas property only if the capability is present, otherwise
>>> there will be serious performance degradation.
>>>=20
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>=20
>> Only a few minor nits. Ben already commented on implementation =
details.
>>=20
>>>=20
>>> ---
>>> Changelog:
>>> 2013/06/05:
>>> * fixed mistype about IBMVIO in the commit message
>>> * updated doc and moved it to another section
>>> * changed capability number
>>>=20
>>> 2013/05/21:
>>> * added kvm_vcpu_arch::tce_tmp
>>> * removed cleanup if put_indirect failed, instead we do not even =
start
>>> writing to TCE table if we cannot get TCEs from the user and they =
are
>>> invalid
>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>> and kvmppc_emulated_validate_tce (for the previous item)
>>> * fixed bug with failthrough for H_IPI
>>> * removed all get_user() from real mode handlers
>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte =
public)
>>> ---
>>> Documentation/virtual/kvm/api.txt       |   17 ++
>>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 =
+++++++++++++++++++++++++++----
>>> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>>> arch/powerpc/kvm/powerpc.c              |    3 +
>>> include/uapi/linux/kvm.h                |    1 +
>>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>>=20
>>> diff --git a/Documentation/virtual/kvm/api.txt =
b/Documentation/virtual/kvm/api.txt
>>> index 5f91eda..6c082ff 100644
>>> --- a/Documentation/virtual/kvm/api.txt
>>> +++ b/Documentation/virtual/kvm/api.txt
>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be =
passed to userspace to be
>>> handled.
>>>=20
>>>=20
>>> +4.83 KVM_CAP_PPC_MULTITCE
>>> +
>>> +Capability: KVM_CAP_PPC_MULTITCE
>>> +Architectures: ppc
>>> +Type: vm
>>> +
>>> +This capability tells the guest that multiple TCE entry add/remove =
hypercalls
>>> +handling is supported by the kernel. This significanly accelerates =
DMA
>>> +operations for PPC KVM guests.
>>> +
>>> +Unlike other capabilities in this section, this one does not have =
an ioctl.
>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not =
passed to
>>> +the guest. Othwerwise it might be better for the guest to continue =
using H_PUT_TCE
>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are =
present).
>>=20
>=20
>> While this describes perfectly well what the consequences are of the
>> patches, it does not describe properly what the CAP actually =
expresses.
>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT =
and
>> H_STUFF_TCE hypercalls directly". All other consequences are nice to
>> document, but the semantics of the CAP are missing.
>=20
>=20
> ? It expresses ability to handle 2 hcalls. What is missing?

You don't describe the kvm <-> qemu interface. You describe some =
decisions qemu can take from this cap.

>=20
>=20
>> We also usually try to keep KVM behavior unchanged with regards to =
older
>> versions until a CAP is enabled. In this case I don't think it =
matters
>> all that much, so I'm fine with declaring it as enabled by default.
>> Please document that this is a change in behavior versus older KVM
>> versions though.
>=20
>=20
> Ok!
>=20
>=20
>>> +
>>> +
>>> 5. The kvm_run structure
>>> ------------------------
>>>=20
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h =
b/arch/powerpc/include/asm/kvm_host.h
>>> index af326cd..85d8f26 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>>> 	spinlock_t tbacct_lock;
>>> 	u64 busy_stolen;
>>> 	u64 busy_preempt;
>>> +
>>> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT =
hall */
>>> #endif
>>> };
>>=20
>> [...]
>>>=20
>>>=20
>>=20
>> [...]
>>=20
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c =
b/arch/powerpc/kvm/book3s_hv.c
>>> index 550f592..a39039a 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu =
*vcpu)
>>> 			ret =3D kvmppc_xics_hcall(vcpu, req);
>>> 			break;
>>> 		} /* fallthrough */
>>=20
>> The fallthrough comment isn't accurate anymore.
>>=20
>>> +		return RESUME_HOST;
>>> +	case H_PUT_TCE:
>>> +		ret =3D kvmppc_virtmode_h_put_tce(vcpu, =
kvmppc_get_gpr(vcpu, 4),
>>> +						kvmppc_get_gpr(vcpu, 5),
>>> +						kvmppc_get_gpr(vcpu, =
6));
>>> +		if (ret =3D=3D H_TOO_HARD)
>>> +			return RESUME_HOST;
>>> +		break;
>>> +	case H_PUT_TCE_INDIRECT:
>>> +		ret =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, =
kvmppc_get_gpr(vcpu, 4),
>>> +						kvmppc_get_gpr(vcpu, 5),
>>> +						kvmppc_get_gpr(vcpu, 6),
>>> +						kvmppc_get_gpr(vcpu, =
7));
>>> +		if (ret =3D=3D H_TOO_HARD)
>>> +			return RESUME_HOST;
>>> +		break;
>>> +	case H_STUFF_TCE:
>>> +		ret =3D kvmppc_virtmode_h_stuff_tce(vcpu, =
kvmppc_get_gpr(vcpu, 4),
>>> +						kvmppc_get_gpr(vcpu, 5),
>>> +						kvmppc_get_gpr(vcpu, 6),
>>> +						kvmppc_get_gpr(vcpu, =
7));
>>> +		if (ret =3D=3D H_TOO_HARD)
>>> +			return RESUME_HOST;
>>> +		break;
>>> 	default:
>>> 		return RESUME_HOST;
>>> 	}
>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct =
kvm *kvm, unsigned int id)
>>> 	vcpu->arch.cpu_type =3D KVM_CPU_3S_64;
>>> 	kvmppc_sanity_check(vcpu);
>>>=20
>>> +	/*
>>> +	 * As we want to minimize the chance of having =
H_PUT_TCE_INDIRECT
>>> +	 * half executed, we first read TCEs from the user, check them =
and
>>> +	 * return error if something went wrong and only then put TCEs =
into
>>> +	 * the TCE table.
>>> +	 *
>>> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
>>> +	 * kmalloc as the whole TCE list can take up to 512 items 8 =
bytes
>>> +	 * each (4096 bytes).
>>> +	 */
>>> +	vcpu->arch.tce_tmp =3D kmalloc(4096, GFP_KERNEL);
>>> +	if (!vcpu->arch.tce_tmp)
>>> +		goto free_vcpu;
>>> +
>>> 	return vcpu;
>>>=20
>>> free_vcpu:
>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu =
*vcpu)
>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>>> 	spin_unlock(&vcpu->arch.vpa_update_lock);
>>> +	kfree(vcpu->arch.tce_tmp);
>>> 	kvm_vcpu_uninit(vcpu);
>>> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
>>> }
>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S =
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> index b02f91e..d35554e 100644
>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>>> 	.long	0		/* 0x11c */
>>> 	.long	0		/* 0x120 */
>>> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
>>> +	.long	0		/* 0x128 */
>>> +	.long	0		/* 0x12c */
>>> +	.long	0		/* 0x130 */
>>> +	.long	0		/* 0x134 */
>>> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
>>> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>>> hcall_real_table_end:
>>>=20
>>> ignore_hdec:
>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c =
b/arch/powerpc/kvm/book3s_pr_papr.c
>>> index da0e0bc..91d4b45 100644
>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu =
*vcpu)
>>> 	unsigned long tce =3D kvmppc_get_gpr(vcpu, 6);
>>> 	long rc;
>>>=20
>>> -	rc =3D kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>>> +	rc =3D kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>>> +	if (rc =3D=3D H_TOO_HARD)
>>> +		return EMULATE_FAIL;
>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>> +	return EMULATE_DONE;
>>> +}
>>> +
>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>>> +{
>>> +	unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4);
>>> +	unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5);
>>> +	unsigned long tce =3D kvmppc_get_gpr(vcpu, 6);
>>> +	unsigned long npages =3D kvmppc_get_gpr(vcpu, 7);
>>> +	long rc;
>>> +
>>> +	rc =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>>> +			tce, npages);
>>> +	if (rc =3D=3D H_TOO_HARD)
>>> +		return EMULATE_FAIL;
>>> +	kvmppc_set_gpr(vcpu, 3, rc);
>>> +	return EMULATE_DONE;
>>> +}
>>> +
>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>>> +{
>>> +	unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4);
>>> +	unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5);
>>> +	unsigned long tce_value =3D kvmppc_get_gpr(vcpu, 6);
>>> +	unsigned long npages =3D kvmppc_get_gpr(vcpu, 7);
>>> +	long rc;
>>> +
>>> +	rc =3D kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, =
npages);
>>> 	if (rc =3D=3D H_TOO_HARD)
>>> 		return EMULATE_FAIL;
>>> 	kvmppc_set_gpr(vcpu, 3, rc);
>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned =
long cmd)
>>> 		return kvmppc_h_pr_bulk_remove(vcpu);
>>> 	case H_PUT_TCE:
>>> 		return kvmppc_h_pr_put_tce(vcpu);
>>> +	case H_PUT_TCE_INDIRECT:
>>> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
>>> +	case H_STUFF_TCE:
>>> +		return kvmppc_h_pr_stuff_tce(vcpu);
>>> 	case H_CEDE:
>>> 		vcpu->arch.shared->msr |=3D MSR_EE;
>>> 		kvm_vcpu_block(vcpu);
>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>> index 6316ee3..8465c2a 100644
>>> --- a/arch/powerpc/kvm/powerpc.c
>>> +++ b/arch/powerpc/kvm/powerpc.c
>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> 		r =3D 1;
>>> 		break;
>>> #endif
>>> +	case KVM_CAP_SPAPR_MULTITCE:
>>> +		r =3D 1;
>>=20
>> This should only be true for book3s.
>=20
>=20
> We had this discussion with v2.
>=20
> David:
> =3D=3D=3D
> So, in the case of MULTITCE, that's not quite right.  PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.  We can't make it dependent on PAPR
> mode being selected, because that's enabled per-vcpu, whereas these
> capabilities are queried on the VM before the vcpus are created.
> =3D=3D=3D
>=20
> Wrong?

Partially. BookE can not emulate a PAPR system as it stands today.

The code should of course be generic and be available generically. But =
your code only patches the hypercall's availability in the book3s =
hypercall handlers, so that specific kernel version can only handle =
these hypercalls on book3s.

Whether a later version of the kernel will be able to handle them is a =
different question.


Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-16 22:06   ` Alexander Graf
@ 2013-06-17  7:55     ` Alexey Kardashevskiy
  2013-06-17  8:02       ` Alexander Graf
  2013-06-17  8:37       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-17  7:55 UTC (permalink / raw)
  To: Alexander Graf
  Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson

On 06/17/2013 08:06 AM, Alexander Graf wrote:
> 
> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
> 
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>> devices or emulated PCI.  These calls allow adding multiple entries
>> (up to 512) into the TCE table in one call which saves time on
>> transition to/from real mode.
>>
>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>> (copied from user and verified) before writing the whole list into
>> the TCE table. This cache will be utilized more in the upcoming
>> VFIO/IOMMU support to continue TCE list processing in the virtual
>> mode in the case if the real mode handler failed for some reason.
>>
>> This adds a guest physical to host real address converter
>> and calls the existing H_PUT_TCE handler. The converting function
>> is going to be fully utilized by upcoming VFIO supporting patches.
>>
>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>> so in order to support the functionality of this patch, QEMU
>> needs to query for this capability and set the "hcall-multi-tce"
>> hypertas property only if the capability is present, otherwise
>> there will be serious performance degradation.
>>
>> Cc: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Signed-off-by: Paul Mackerras <paulus@samba.org>
> 
> Only a few minor nits. Ben already commented on implementation details.
> 
>>
>> ---
>> Changelog:
>> 2013/06/05:
>> * fixed mistype about IBMVIO in the commit message
>> * updated doc and moved it to another section
>> * changed capability number
>>
>> 2013/05/21:
>> * added kvm_vcpu_arch::tce_tmp
>> * removed cleanup if put_indirect failed, instead we do not even start
>> writing to TCE table if we cannot get TCEs from the user and they are
>> invalid
>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>> and kvmppc_emulated_validate_tce (for the previous item)
>> * fixed bug with failthrough for H_IPI
>> * removed all get_user() from real mode handlers
>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>> ---
>> Documentation/virtual/kvm/api.txt       |   17 ++
>> arch/powerpc/include/asm/kvm_host.h     |    2 +
>> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>> arch/powerpc/kvm/powerpc.c              |    3 +
>> include/uapi/linux/kvm.h                |    1 +
>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index 5f91eda..6c082ff 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>> handled.
>>
>>
>> +4.83 KVM_CAP_PPC_MULTITCE
>> +
>> +Capability: KVM_CAP_PPC_MULTITCE
>> +Architectures: ppc
>> +Type: vm
>> +
>> +This capability tells the guest that multiple TCE entry add/remove hypercalls
>> +handling is supported by the kernel. This significanly accelerates DMA
>> +operations for PPC KVM guests.
>> +
>> +Unlike other capabilities in this section, this one does not have an ioctl.
>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
> 

> While this describes perfectly well what the consequences are of the
> patches, it does not describe properly what the CAP actually expresses.
> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls directly". All other consequences are nice to
> document, but the semantics of the CAP are missing.


? It expresses ability to handle 2 hcalls. What is missing?


> We also usually try to keep KVM behavior unchanged with regards to older
> versions until a CAP is enabled. In this case I don't think it matters
> all that much, so I'm fine with declaring it as enabled by default.
> Please document that this is a change in behavior versus older KVM
> versions though.


Ok!


>> +
>> +
>> 5. The kvm_run structure
>> ------------------------
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index af326cd..85d8f26 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>> 	spinlock_t tbacct_lock;
>> 	u64 busy_stolen;
>> 	u64 busy_preempt;
>> +
>> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>> #endif
>> };
> 
> [...]
>>
>>
> 
> [...]
> 
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 550f592..a39039a 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>> 			ret = kvmppc_xics_hcall(vcpu, req);
>> 			break;
>> 		} /* fallthrough */
> 
> The fallthrough comment isn't accurate anymore.
> 
>> +		return RESUME_HOST;
>> +	case H_PUT_TCE:
>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +						kvmppc_get_gpr(vcpu, 5),
>> +						kvmppc_get_gpr(vcpu, 6));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_PUT_TCE_INDIRECT:
>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +						kvmppc_get_gpr(vcpu, 5),
>> +						kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_STUFF_TCE:
>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +						kvmppc_get_gpr(vcpu, 5),
>> +						kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> 	default:
>> 		return RESUME_HOST;
>> 	}
>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>> 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>> 	kvmppc_sanity_check(vcpu);
>>
>> +	/*
>> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
>> +	 * half executed, we first read TCEs from the user, check them and
>> +	 * return error if something went wrong and only then put TCEs into
>> +	 * the TCE table.
>> +	 *
>> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
>> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
>> +	 * each (4096 bytes).
>> +	 */
>> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
>> +	if (!vcpu->arch.tce_tmp)
>> +		goto free_vcpu;
>> +
>> 	return vcpu;
>>
>> free_vcpu:
>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>> 	spin_unlock(&vcpu->arch.vpa_update_lock);
>> +	kfree(vcpu->arch.tce_tmp);
>> 	kvm_vcpu_uninit(vcpu);
>> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
>> }
>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> index b02f91e..d35554e 100644
>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>> 	.long	0		/* 0x11c */
>> 	.long	0		/* 0x120 */
>> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
>> +	.long	0		/* 0x128 */
>> +	.long	0		/* 0x12c */
>> +	.long	0		/* 0x130 */
>> +	.long	0		/* 0x134 */
>> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
>> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>> hcall_real_table_end:
>>
>> ignore_hdec:
>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
>> index da0e0bc..91d4b45 100644
>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>> 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>> 	long rc;
>>
>> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>> +	if (rc == H_TOO_HARD)
>> +		return EMULATE_FAIL;
>> +	kvmppc_set_gpr(vcpu, 3, rc);
>> +	return EMULATE_DONE;
>> +}
>> +
>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>> +	long rc;
>> +
>> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>> +			tce, npages);
>> +	if (rc == H_TOO_HARD)
>> +		return EMULATE_FAIL;
>> +	kvmppc_set_gpr(vcpu, 3, rc);
>> +	return EMULATE_DONE;
>> +}
>> +
>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
>> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>> +	long rc;
>> +
>> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>> 	if (rc == H_TOO_HARD)
>> 		return EMULATE_FAIL;
>> 	kvmppc_set_gpr(vcpu, 3, rc);
>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>> 		return kvmppc_h_pr_bulk_remove(vcpu);
>> 	case H_PUT_TCE:
>> 		return kvmppc_h_pr_put_tce(vcpu);
>> +	case H_PUT_TCE_INDIRECT:
>> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
>> +	case H_STUFF_TCE:
>> +		return kvmppc_h_pr_stuff_tce(vcpu);
>> 	case H_CEDE:
>> 		vcpu->arch.shared->msr |= MSR_EE;
>> 		kvm_vcpu_block(vcpu);
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 6316ee3..8465c2a 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>> 		r = 1;
>> 		break;
>> #endif
>> +	case KVM_CAP_SPAPR_MULTITCE:
>> +		r = 1;
> 
> This should only be true for book3s.


We had this discussion with v2.

David:
===
So, in the case of MULTITCE, that's not quite right.  PR KVM can
emulate a PAPR system on a BookE machine, and there's no reason not to
allow TCE acceleration as well.  We can't make it dependent on PAPR
mode being selected, because that's enabled per-vcpu, whereas these
capabilities are queried on the VM before the vcpus are created.
===

Wrong?


-- 
Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
  2013-06-16  4:20   ` Benjamin Herrenschmidt
@ 2013-06-16 22:06   ` Alexander Graf
  2013-06-17  7:55     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 36+ messages in thread
From: Alexander Graf @ 2013-06-16 22:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, linux-kernel, kvm-ppc, Paul Mackerras, linuxppc-dev, David Gibson


On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:

> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
>=20
> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
>=20
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
>=20
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
>=20
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Only a few minor nits. Ben already commented on implementation details.

>=20
> ---
> Changelog:
> 2013/06/05:
> * fixed mistype about IBMVIO in the commit message
> * updated doc and moved it to another section
> * changed capability number
>=20
> 2013/05/21:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte =
public)
> ---
> Documentation/virtual/kvm/api.txt       |   17 ++
> arch/powerpc/include/asm/kvm_host.h     |    2 +
> arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
> arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
> arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 =
+++++++++++++++++++++++++++----
> arch/powerpc/kvm/book3s_hv.c            |   39 +++++
> arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
> arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
> arch/powerpc/kvm/powerpc.c              |    3 +
> include/uapi/linux/kvm.h                |    1 +
> 10 files changed, 473 insertions(+), 32 deletions(-)
>=20
> diff --git a/Documentation/virtual/kvm/api.txt =
b/Documentation/virtual/kvm/api.txt
> index 5f91eda..6c082ff 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be =
passed to userspace to be
> handled.
>=20
>=20
> +4.83 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability tells the guest that multiple TCE entry add/remove =
hypercalls
> +handling is supported by the kernel. This significanly accelerates =
DMA
> +operations for PPC KVM guests.
> +
> +Unlike other capabilities in this section, this one does not have an =
ioctl.
> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not =
passed to
> +the guest. Othwerwise it might be better for the guest to continue =
using H_PUT_TCE
> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are =
present).

While this describes perfectly well what the consequences are of the =
patches, it does not describe properly what the CAP actually expresses. =
The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and =
H_STUFF_TCE hypercalls directly". All other consequences are nice to =
document, but the semantics of the CAP are missing.

We also usually try to keep KVM behavior unchanged with regards to older =
versions until a CAP is enabled. In this case I don't think it matters =
all that much, so I'm fine with declaring it as enabled by default. =
Please document that this is a change in behavior versus older KVM =
versions though.

> +
> +
> 5. The kvm_run structure
> ------------------------
>=20
> diff --git a/arch/powerpc/include/asm/kvm_host.h =
b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..85d8f26 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
> 	spinlock_t tbacct_lock;
> 	u64 busy_stolen;
> 	u64 busy_preempt;
> +
> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT =
hall */
> #endif
> };

[...]
>=20
>=20

[...]

> diff --git a/arch/powerpc/kvm/book3s_hv.c =
b/arch/powerpc/kvm/book3s_hv.c
> index 550f592..a39039a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu =
*vcpu)
> 			ret =3D kvmppc_xics_hcall(vcpu, req);
> 			break;
> 		} /* fallthrough */

The fallthrough comment isn't accurate anymore.

> +		return RESUME_HOST;
> +	case H_PUT_TCE:
> +		ret =3D kvmppc_virtmode_h_put_tce(vcpu, =
kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, =
6));
> +		if (ret =3D=3D H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, =
kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, =
7));
> +		if (ret =3D=3D H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret =3D kvmppc_virtmode_h_stuff_tce(vcpu, =
kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, =
7));
> +		if (ret =3D=3D H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> 	default:
> 		return RESUME_HOST;
> 	}
> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct =
kvm *kvm, unsigned int id)
> 	vcpu->arch.cpu_type =3D KVM_CPU_3S_64;
> 	kvmppc_sanity_check(vcpu);
>=20
> +	/*
> +	 * As we want to minimize the chance of having =
H_PUT_TCE_INDIRECT
> +	 * half executed, we first read TCEs from the user, check them =
and
> +	 * return error if something went wrong and only then put TCEs =
into
> +	 * the TCE table.
> +	 *
> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
> +	 * kmalloc as the whole TCE list can take up to 512 items 8 =
bytes
> +	 * each (4096 bytes).
> +	 */
> +	vcpu->arch.tce_tmp =3D kmalloc(4096, GFP_KERNEL);
> +	if (!vcpu->arch.tce_tmp)
> +		goto free_vcpu;
> +
> 	return vcpu;
>=20
> free_vcpu:
> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
> 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
> 	spin_unlock(&vcpu->arch.vpa_update_lock);
> +	kfree(vcpu->arch.tce_tmp);
> 	kvm_vcpu_uninit(vcpu);
> 	kmem_cache_free(kvm_vcpu_cache, vcpu);
> }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S =
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b02f91e..d35554e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1490,6 +1490,12 @@ hcall_real_table:
> 	.long	0		/* 0x11c */
> 	.long	0		/* 0x120 */
> 	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
> hcall_real_table_end:
>=20
> ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c =
b/arch/powerpc/kvm/book3s_pr_papr.c
> index da0e0bc..91d4b45 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu =
*vcpu)
> 	unsigned long tce =3D kvmppc_get_gpr(vcpu, 6);
> 	long rc;
>=20
> -	rc =3D kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc =3D kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc =3D=3D H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce =3D kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages =3D kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc =3D kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc =3D=3D H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn =3D kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba =3D kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value =3D kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages =3D kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc =3D kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, =
npages);
> 	if (rc =3D=3D H_TOO_HARD)
> 		return EMULATE_FAIL;
> 	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned =
long cmd)
> 		return kvmppc_h_pr_bulk_remove(vcpu);
> 	case H_PUT_TCE:
> 		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
> 	case H_CEDE:
> 		vcpu->arch.shared->msr |=3D MSR_EE;
> 		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 6316ee3..8465c2a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
> 		r =3D 1;
> 		break;
> #endif
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r =3D 1;

This should only be true for book3s.



Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
@ 2013-06-16  4:20   ` Benjamin Herrenschmidt
  2013-06-16 22:06   ` Alexander Graf
  1 sibling, 0 replies; 36+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-16  4:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, linux-kernel, kvm-ppc, Alexander Graf, Paul Mackerras,
	linuxppc-dev, David Gibson

On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> 
> ---
> Changelog:
> 2013/06/05:
> * fixed mistype about IBMVIO in the commit message
> * updated doc and moved it to another section
> * changed capability number
> 
> 2013/05/21:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
> ---
>  Documentation/virtual/kvm/api.txt       |   17 ++
>  arch/powerpc/include/asm/kvm_host.h     |    2 +
>  arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
>  arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
>  arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
>  arch/powerpc/kvm/book3s_hv.c            |   39 +++++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
>  arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
>  arch/powerpc/kvm/powerpc.c              |    3 +
>  include/uapi/linux/kvm.h                |    1 +
>  10 files changed, 473 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 5f91eda..6c082ff 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>  handled.
>  
> 
> +4.83 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability tells the guest that multiple TCE entry add/remove hypercalls
> +handling is supported by the kernel. This significanly accelerates DMA
> +operations for PPC KVM guests.
> +
> +Unlike other capabilities in this section, this one does not have an ioctl.
> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
> +
> +
>  5. The kvm_run structure
>  ------------------------
>  
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..85d8f26 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>  	spinlock_t tbacct_lock;
>  	u64 busy_stolen;
>  	u64 busy_preempt;
> +
> +	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
>  #endif
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index a5287fe..e852921b 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce *args);
> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> -			     unsigned long ioba, unsigned long tce);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> +		struct kvm_vcpu *vcpu, unsigned long liobn);
> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages);
>  extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>  				struct kvm_allocate_rma *rma);
>  extern struct kvmppc_linear_info *kvm_alloc_rma(void);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index b2d3f3b..06b7b20 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -36,8 +37,11 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>  
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
> @@ -148,3 +152,117 @@ fail:
>  	}
>  	return ret;
>  }
> +
> +/* Converts guest physical address into host virtual */
> +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)
> +{
> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
> +	struct kvm_memory_slot *memslot;
> +
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
> +	return (void *) hva;
> +}
> +
> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	long ret;
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if (ioba >= tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce);
> +	if (ret)
> +		return ret;
> +
> +	kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +	unsigned long __user *tces;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/*
> +	 * The spec says that the maximum size of the list is 512 TCEs so
> +	 * so the whole table addressed resides in 4K page
> +	 */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	if (tce_list & ~IOMMU_PAGE_MASK)
> +		return H_PARAMETER;
> +
> +	tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (get_user(vcpu->arch.tce_tmp[i], tces + i))
> +			return H_PARAMETER;
> +
> +		ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_emulated_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT),
> +				vcpu->arch.tce_tmp[i]);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce_value);
> +	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..c68d538 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -35,42 +36,249 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - *          mode on PR KVM
> +/* Finds a TCE table descriptor by LIOBN */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> +		unsigned long liobn)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == liobn)
> +			return tt;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +/*
> + * Validate TCE address.
> + * At the moment only flags are validated
> + * as other check will significantly slow down
> + * or can make it even impossible to handle TCE requests
> + * in real mode.
> + */
> +long kvmppc_emulated_validate_tce(unsigned long tce)
> +{
> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
> +
> +/*
> + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
> + * by QEMU. It puts guest TCE values into the table and expects
> + * the QEMU to convert them later in the QEMU device implementation.
> + * Wiorks in both real and virtual modes.
> + * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
>   */
> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/*
> +	 * Note on the use of page_address() in real mode,
> +	 *
> +	 * It is safe to use page_address() in real mode on ppc64 because
> +	 * page_address() is always defined as lowmem_page_address()
> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> +	 * operation and does not access page struct.
> +	 *
> +	 * Theoretically page_address() could be defined different
> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> +	 * should be enabled.
> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> +	 * is not expected to be enabled on ppc32, page_address()
> +	 * is safe for ppc32 as well.
> +	 */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	page = tt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);
> +
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +
> +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
> +			unsigned long *pte_sizep)
> +{
> +	pte_t *ptep;
> +	unsigned int shift = 0;
> +	pte_t pte, tmp, ret;
> +
> +	ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
> +	if (!ptep)
> +		return __pte(0);
> +	if (shift)
> +		*pte_sizep = 1ul << shift;
> +	else
> +		*pte_sizep = PAGE_SIZE;
> +
> +	if (!pte_present(*ptep))
> +		return __pte(0);
> +
> +	/* wait until _PAGE_BUSY is clear then set it atomically */
> +	__asm__ __volatile__ (
> +		"1:	ldarx	%0,0,%3\n"
> +		"	andi.	%1,%0,%4\n"
> +		"	bne-	1b\n"
> +		"	ori	%1,%0,%4\n"
> +		"	stdcx.	%1,0,%3\n"
> +		"	bne-	1b"
> +		: "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
> +		: "r" (ptep), "i" (_PAGE_BUSY)
> +		: "cc");
> +
> +	ret = pte;
> +
> +	return ret;
> +}

The test for pte_present() needs to be done again after you lock the PTE
since potentially you could have raced with an invalidation.

More worrisome: You set _PAGE_BUSY above (lock the PTE). But when do you
clear it ?

It looks like you rely on _PAGE_BUSY being set to protect yourself
against any concurrent invalidation (or other change to the PTE) while
you access the underlying page. This is *somewhat* ok, though frowned
upon since you end up locking the PTE a lot longer (in real mode) than
we normally do, but in any case, you need to ensure that you release
that lock.

Also you must *not* continue using the resulting physical address after
releasing the lock since the page might be invalidated/freed/swapped_out
etc... at any point once you clear busy.

It might be better to use the MMU notifiers here to catch concurrent
invalidations rather than locking the PTE for a long time. If a
concurrent invalidation happens, just return TOO_HARD.

> +
> +/*
> + * Converts guest physical address into host physical address.
> + * Also returns pte and page size if the page is present in page table.
> + */
> +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)
> +{
> +	struct kvm_memory_slot *memslot;
> +	pte_t pte;
> +	unsigned long hva, hpa, pg_size = 0, offset;
> +	unsigned long gfn = gpa >> PAGE_SHIFT;
> +	bool writing = gpa & TCE_PCI_WRITE;
> +
> +	/* Find a KVM memslot */
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/* Convert guest physical address to host virtual */
> +	hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> +	/* Find a PTE and determine the size */
> +	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte)
> +		return ERROR_ADDR;
> +
> +	/* Calculate host phys address keeping flags and offset in the page */
> +	offset = gpa & (pg_size - 1);
> +
> +	/* pte_pfn(pte) should return an address aligned to pg_size */
> +	hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
> +
> +	return hpa;
> +}

Do you ever test whether the page protection on the PTE allows for
access ? At the moment you only use that to read from TCEs so chances
that this is wrong are slim (you wouldn't have PROT_NONE on TCE tables),
but it's still a worry to have code like that.

Also you do not set _PAGE_ACCESSED either, which means the VM doesn't
know the page is being accessed. Not necessarily a huge deal in this
specific case, but still.

>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
> -	struct kvmppc_spapr_tce_table *stt;
> -
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> -
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> -
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> -
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> -
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> +	long ret;
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if (ioba >= tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce);
> +	if (ret)
> +		return ret;
> +
> +	kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +	unsigned long *tces;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/*
> +	 * The spec says that the maximum size of the list is 512 TCEs so
> +	 * so the whole table addressed resides in 4K page
> +	 */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	if (tce_list & ~IOMMU_PAGE_MASK)
> +		return H_PARAMETER;
> +
> +	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
> +	if ((unsigned long)tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		ret = kvmppc_emulated_validate_tce(tces[i]);
> +		if (ret)
> +			return ret;
>  	}
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +	for (i = 0; i < npages; ++i)
> +		kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				tces[i]);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce_value);
> +	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
>  }
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 550f592..a39039a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_xics_hcall(vcpu, req);
>  			break;
>  		} /* fallthrough */
> +		return RESUME_HOST;
> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>  	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>  	kvmppc_sanity_check(vcpu);
>  
> +	/*
> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
> +	 * half executed, we first read TCEs from the user, check them and
> +	 * return error if something went wrong and only then put TCEs into
> +	 * the TCE table.
> +	 *
> +	 * tce_tmp is a cache for TCEs to avoid stack allocation or
> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
> +	 * each (4096 bytes).
> +	 */
> +	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
> +	if (!vcpu->arch.tce_tmp)
> +		goto free_vcpu;
> +
>  	return vcpu;
>  
>  free_vcpu:
> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>  	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>  	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>  	spin_unlock(&vcpu->arch.vpa_update_lock);
> +	kfree(vcpu->arch.tce_tmp);
>  	kvm_vcpu_uninit(vcpu);
>  	kmem_cache_free(kvm_vcpu_cache, vcpu);
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b02f91e..d35554e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1490,6 +1490,12 @@ hcall_real_table:
>  	.long	0		/* 0x11c */
>  	.long	0		/* 0x120 */
>  	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>  hcall_real_table_end:
>  
>  ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index da0e0bc..91d4b45 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>  	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>  	long rc;
>  
> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  	if (rc == H_TOO_HARD)
>  		return EMULATE_FAIL;
>  	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>  		return kvmppc_h_pr_bulk_remove(vcpu);
>  	case H_PUT_TCE:
>  		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
>  	case H_CEDE:
>  		vcpu->arch.shared->msr |= MSR_EE;
>  		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 6316ee3..8465c2a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a5c86fc..fc0d6b9 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_IRQ_MPIC 90
>  #define KVM_CAP_PPC_RTAS 91
>  #define KVM_CAP_IRQ_XICS 92
> +#define KVM_CAP_SPAPR_MULTITCE 93
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
  2013-06-05  6:11 [PATCH 0/4 v3] KVM: PPC: " Alexey Kardashevskiy
@ 2013-06-05  6:11 ` Alexey Kardashevskiy
  2013-06-16  4:20   ` Benjamin Herrenschmidt
  2013-06-16 22:06   ` Alexander Graf
  0 siblings, 2 replies; 36+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-05  6:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---
Changelog:
2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt       |   17 ++
 arch/powerpc/include/asm/kvm_host.h     |    2 +
 arch/powerpc/include/asm/kvm_ppc.h      |   16 +-
 arch/powerpc/kvm/book3s_64_vio.c        |  118 ++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  266 +++++++++++++++++++++++++++----
 arch/powerpc/kvm/book3s_hv.c            |   39 +++++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 10 files changed, 473 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 5f91eda..6c082ff 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
 handled.
 
 
+4.83 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability tells the guest that multiple TCE entry add/remove hypercalls
+handling is supported by the kernel. This significanly accelerates DMA
+operations for PPC KVM guests.
+
+Unlike other capabilities in this section, this one does not have an ioctl.
+Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
+H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
+the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
+hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..85d8f26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
 	spinlock_t tbacct_lock;
 	u64 busy_stolen;
 	u64 busy_preempt;
+
+	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-			     unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+		struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..06b7b20 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -36,8 +37,11 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
@@ -148,3 +152,117 @@ fail:
 	}
 	return ret;
 }
+
+/* Converts guest physical address into host virtual */
+static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+	return (void *) hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce);
+	if (ret)
+		return ret;
+
+	kvmppc_emulated_put_tce(tt, ioba, tce);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+	unsigned long __user *tces;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs so
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+			return H_PARAMETER;
+
+		ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
+		if (ret)
+			return ret;
+	}
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_emulated_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT),
+				vcpu->arch.tce_tmp[i]);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c68d538 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -35,42 +36,249 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+/* Finds a TCE table descriptor by LIOBN */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == liobn)
+			return tt;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validate TCE address.
+ * At the moment only flags are validated
+ * as other check will significantly slow down
+ * or can make it even impossible to handle TCE requests
+ * in real mode.
+ */
+long kvmppc_emulated_validate_tce(unsigned long tce)
+{
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
+
+/*
+ * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Wiorks in both real and virtual modes.
+ * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
  */
+void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/*
+	 * Note on the use of page_address() in real mode,
+	 *
+	 * It is safe to use page_address() in real mode on ppc64 because
+	 * page_address() is always defined as lowmem_page_address()
+	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+	 * operation and does not access page struct.
+	 *
+	 * Theoretically page_address() could be defined different
+	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+	 * should be enabled.
+	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+	 * is not expected to be enabled on ppc32, page_address()
+	 * is safe for ppc32 as well.
+	 */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	page = tt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+
+static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
+			unsigned long *pte_sizep)
+{
+	pte_t *ptep;
+	unsigned int shift = 0;
+	pte_t pte, tmp, ret;
+
+	ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
+	if (!ptep)
+		return __pte(0);
+	if (shift)
+		*pte_sizep = 1ul << shift;
+	else
+		*pte_sizep = PAGE_SIZE;
+
+	if (!pte_present(*ptep))
+		return __pte(0);
+
+	/* wait until _PAGE_BUSY is clear then set it atomically */
+	__asm__ __volatile__ (
+		"1:	ldarx	%0,0,%3\n"
+		"	andi.	%1,%0,%4\n"
+		"	bne-	1b\n"
+		"	ori	%1,%0,%4\n"
+		"	stdcx.	%1,0,%3\n"
+		"	bne-	1b"
+		: "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
+		: "r" (ptep), "i" (_PAGE_BUSY)
+		: "cc");
+
+	ret = pte;
+
+	return ret;
+}
+
+/*
+ * Converts guest physical address into host physical address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, hpa, pg_size = 0, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+	bool writing = gpa & TCE_PCI_WRITE;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte)
+		return ERROR_ADDR;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	return hpa;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
-	struct kvmppc_spapr_tce_table *stt;
-
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
-
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
-
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
-
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
-
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce);
+	if (ret)
+		return ret;
+
+	kvmppc_emulated_put_tce(tt, ioba, tce);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+	unsigned long *tces;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs so
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+	if ((unsigned long)tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		ret = kvmppc_emulated_validate_tce(tces[i]);
+		if (ret)
+			return ret;
 	}
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	for (i = 0; i < npages; ++i)
+		kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				tces[i]);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
 }
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 550f592..a39039a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 			ret = kvmppc_xics_hcall(vcpu, req);
 			break;
 		} /* fallthrough */
+		return RESUME_HOST;
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
@@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	kvmppc_sanity_check(vcpu);
 
+	/*
+	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+	 * half executed, we first read TCEs from the user, check them and
+	 * return error if something went wrong and only then put TCEs into
+	 * the TCE table.
+	 *
+	 * tce_tmp is a cache for TCEs to avoid stack allocation or
+	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+	 * each (4096 bytes).
+	 */
+	vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
+	if (!vcpu->arch.tce_tmp)
+		goto free_vcpu;
+
 	return vcpu;
 
 free_vcpu:
@@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
 	spin_unlock(&vcpu->arch.vpa_update_lock);
+	kfree(vcpu->arch.tce_tmp);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vcpu);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..d35554e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1490,6 +1490,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index da0e0bc..91d4b45 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..8465c2a 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_SPAPR_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a5c86fc..fc0d6b9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_MPIC 90
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE 93
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-06-17 10:48 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-21  3:06 [PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
2013-05-21  3:06 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
2013-05-27 10:08   ` Paolo Bonzini
2013-05-27 14:33     ` Alexey Kardashevskiy
2013-05-21  3:06 ` [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
2013-05-21  3:06 ` [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
2013-05-22 21:06   ` Scott Wood
2013-05-25  2:45     ` David Gibson
2013-05-27  2:44       ` Alexey Kardashevskiy
2013-05-28 17:45         ` Scott Wood
2013-05-28 23:30           ` Alexey Kardashevskiy
2013-05-28 23:35             ` Scott Wood
2013-05-29  0:12               ` Alexey Kardashevskiy
2013-05-29 20:05                 ` Scott Wood
2013-05-29 23:10                   ` Alexey Kardashevskiy
2013-05-29 23:14                     ` Scott Wood
2013-05-29 23:29                       ` Alexey Kardashevskiy
2013-05-29 23:32                         ` Scott Wood
2013-05-27 10:23       ` Paolo Bonzini
2013-05-27 14:26         ` Alexey Kardashevskiy
2013-05-27 14:41           ` Paolo Bonzini
2013-05-28 16:32       ` Scott Wood
2013-05-29  0:20         ` Alexey Kardashevskiy
2013-05-21  3:06 ` [PATCH 4/4] KVM: PPC: Add hugepage " Alexey Kardashevskiy
2013-06-05  6:11 [PATCH 0/4 v3] KVM: PPC: " Alexey Kardashevskiy
2013-06-05  6:11 ` [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
2013-06-16  4:20   ` Benjamin Herrenschmidt
2013-06-16 22:06   ` Alexander Graf
2013-06-17  7:55     ` Alexey Kardashevskiy
2013-06-17  8:02       ` Alexander Graf
2013-06-17  8:34         ` Alexey Kardashevskiy
2013-06-17  8:40           ` Alexander Graf
2013-06-17  8:51             ` Alexey Kardashevskiy
2013-06-17 10:46               ` Alexander Graf
2013-06-17 10:48                 ` Alexander Graf
2013-06-17  8:37       ` Benjamin Herrenschmidt
2013-06-17  8:42         ` Alexander Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).