All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] powerpc: lookup_linux_pte has been made public
       [not found] <1360584763-21988-1-git-send-email-a>
  2013-02-11 12:12   ` aik
@ 2013-02-11 12:12   ` aik
  2013-02-11 12:12   ` aik
  2013-02-11 12:12   ` aik
  3 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alexander Graf,
	Michael Ellerman, linuxppc-dev, linux-kernel, kvm-ppc, kvm,
	David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The lookup_linux_pte() function returns a linux PTE which
is required to convert KVM guest physical address into host real
address in real mode.

This convertion will be used by upcoming support of H_PUT_TCE_INDIRECT
as TCE list address comes from the guest directly so it is a guest
physical.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..ddcc898 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		int writing, unsigned long *pte_sizep);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 19c93ba..6a042d0 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -145,8 +145,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		       int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/4] powerpc: lookup_linux_pte has been made public
@ 2013-02-11 12:12   ` aik
  0 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The lookup_linux_pte() function returns a linux PTE which
is required to convert KVM guest physical address into host real
address in real mode.

This convertion will be used by upcoming support of H_PUT_TCE_INDIRECT
as TCE list address comes from the guest directly so it is a guest
physical.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..ddcc898 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		int writing, unsigned long *pte_sizep);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 19c93ba..6a042d0 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -145,8 +145,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		       int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/4] powerpc: lookup_linux_pte has been made public
@ 2013-02-11 12:12   ` aik
  0 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alexander Graf,
	Michael Ellerman, linuxppc-dev, linux-kernel, kvm-ppc, kvm,
	David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The lookup_linux_pte() function returns a linux PTE which
is required to convert KVM guest physical address into host real
address in real mode.

This convertion will be used by upcoming support of H_PUT_TCE_INDIRECT
as TCE list address comes from the guest directly so it is a guest
physical.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..ddcc898 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		int writing, unsigned long *pte_sizep);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 19c93ba..6a042d0 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -145,8 +145,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		       int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
       [not found] <1360584763-21988-1-git-send-email-a>
  2013-02-11 12:12   ` aik
@ 2013-02-11 12:12   ` aik
  2013-02-11 12:12   ` aik
  2013-02-11 12:12   ` aik
  3 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alexander Graf,
	Michael Ellerman, linuxppc-dev, linux-kernel, kvm-ppc, kvm,
	David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The patch adds real mode handlers for H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

The patch adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

The patch also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property if the capability is present.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_ppc.h      |   15 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  241 ++++++++++++++++++++++++++++---
 arch/powerpc/kvm/book3s_hv.c            |   23 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 7 files changed, 301 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 572aa75..76d133b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -136,6 +136,21 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			     unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c38edcd 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -25,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/kvm_host.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -35,42 +37,233 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+static struct kvmppc_spapr_tce_table *find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry(stt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn == liobn)
+			return stt;
+	}
+
+	return NULL;
+}
+
+/*
+ * Converts guest physical address into host virtual
+ * which is to be used later in get_user_pages_fast().
+ */
+static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return 0;
+
+	/*
+	 * Convert gfn to hva preserving flags and an offset
+	 * within a system page
+	 */
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+	/* Find out the page pte and size if requested */
+	if (ptep && pg_sizep) {
+		pte_t pte;
+		unsigned long pg_size = 0;
+
+		pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+				writing, &pg_size);
+		if (!pte_present(pte))
+			return 0;
+
+		*pg_sizep = pg_size;
+		*ptep = pte;
+	}
+
+	return hva;
+}
+
+/*
+ * Converts guest physical address into host real address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long get_real_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, pg_size = 0, hwaddr, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return 0;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return 0;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	/* Copy outer values if required */
+	if (pg_sizep)
+		*pg_sizep = pg_size;
+	if (ptep)
+		*ptep = pte;
+
+	return hwaddr;
+}
+
+/*
+ * emulated_h_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Works in both real and virtual modes.
+ */
+static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
+	/* 	    liobn, stt, stt->window_size); */
+	if (ioba >= stt->window_size) {
+		pr_err("%s failed on ioba=%lx\n", __func__, ioba);
+		return H_PARAMETER;
+	}
+
+	page = stt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/* FIXME: Need to validate the TCE itself */
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+
+	return H_SUCCESS;
+}
+
+/*
+ * Real mode handlers
  */
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return emulated_h_put_tce(stt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret = 0;
+	unsigned long *tces;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
 
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
+	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
+	if (!tces)
+		return H_TOO_HARD;
 
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tces[i]);
 
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
+	return ret;
+}
 
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
-	}
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret = 0;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+	return ret;
+}
+
+/*
+ * Virtual mode handlers
+ */
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	/* At the moment emulated IO is handled the same way */
+	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+}
+
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	unsigned long *tces;
+	long ret = 0, i;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
+	if (!tces)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+	return ret;
+}
+
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	/* At the moment emulated IO is handled the same way */
+	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 71d0c90..13c8436 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 					kvmppc_get_gpr(vcpu, 5),
 					kvmppc_get_gpr(vcpu, 6));
 		break;
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 10b6c35..0826e8b 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1390,6 +1390,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index ee02b30..270e88e 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce, 1);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -240,6 +271,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70739a0..95614c7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_PPC_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e6e5d4b..26e2b271 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQFD_RESAMPLE 82
 #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
 #define KVM_CAP_PPC_HTAB_FD 84
+#define KVM_CAP_PPC_MULTITCE 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
@ 2013-02-11 12:12   ` aik
  0 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The patch adds real mode handlers for H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

The patch adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

The patch also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property if the capability is present.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_ppc.h      |   15 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  241 ++++++++++++++++++++++++++++---
 arch/powerpc/kvm/book3s_hv.c            |   23 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 7 files changed, 301 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 572aa75..76d133b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -136,6 +136,21 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			     unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c38edcd 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -25,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/kvm_host.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -35,42 +37,233 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+static struct kvmppc_spapr_tce_table *find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry(stt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn == liobn)
+			return stt;
+	}
+
+	return NULL;
+}
+
+/*
+ * Converts guest physical address into host virtual
+ * which is to be used later in get_user_pages_fast().
+ */
+static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return 0;
+
+	/*
+	 * Convert gfn to hva preserving flags and an offset
+	 * within a system page
+	 */
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+	/* Find out the page pte and size if requested */
+	if (ptep && pg_sizep) {
+		pte_t pte;
+		unsigned long pg_size = 0;
+
+		pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+				writing, &pg_size);
+		if (!pte_present(pte))
+			return 0;
+
+		*pg_sizep = pg_size;
+		*ptep = pte;
+	}
+
+	return hva;
+}
+
+/*
+ * Converts guest physical address into host real address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long get_real_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, pg_size = 0, hwaddr, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return 0;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return 0;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	/* Copy outer values if required */
+	if (pg_sizep)
+		*pg_sizep = pg_size;
+	if (ptep)
+		*ptep = pte;
+
+	return hwaddr;
+}
+
+/*
+ * emulated_h_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Works in both real and virtual modes.
+ */
+static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
+	/* 	    liobn, stt, stt->window_size); */
+	if (ioba >= stt->window_size) {
+		pr_err("%s failed on ioba=%lx\n", __func__, ioba);
+		return H_PARAMETER;
+	}
+
+	page = stt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/* FIXME: Need to validate the TCE itself */
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+
+	return H_SUCCESS;
+}
+
+/*
+ * Real mode handlers
  */
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return emulated_h_put_tce(stt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret = 0;
+	unsigned long *tces;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
 
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
+	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
+	if (!tces)
+		return H_TOO_HARD;
 
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tces[i]);
 
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
+	return ret;
+}
 
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
-	}
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret = 0;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+	return ret;
+}
+
+/*
+ * Virtual mode handlers
+ */
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	/* At the moment emulated IO is handled the same way */
+	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+}
+
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	unsigned long *tces;
+	long ret = 0, i;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
+	if (!tces)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+	return ret;
+}
+
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	/* At the moment emulated IO is handled the same way */
+	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 71d0c90..13c8436 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 					kvmppc_get_gpr(vcpu, 5),
 					kvmppc_get_gpr(vcpu, 6));
 		break;
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 10b6c35..0826e8b 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1390,6 +1390,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index ee02b30..270e88e 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce, 1);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -240,6 +271,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70739a0..95614c7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_PPC_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e6e5d4b..26e2b271 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQFD_RESAMPLE 82
 #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
 #define KVM_CAP_PPC_HTAB_FD 84
+#define KVM_CAP_PPC_MULTITCE 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
@ 2013-02-11 12:12   ` aik
  0 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alexander Graf,
	Michael Ellerman, linuxppc-dev, linux-kernel, kvm-ppc, kvm,
	David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The patch adds real mode handlers for H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

The patch adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

The patch also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property if the capability is present.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_ppc.h      |   15 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  241 ++++++++++++++++++++++++++++---
 arch/powerpc/kvm/book3s_hv.c            |   23 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 7 files changed, 301 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 572aa75..76d133b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -136,6 +136,21 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			     unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c38edcd 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -25,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/kvm_host.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -35,42 +37,233 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+static struct kvmppc_spapr_tce_table *find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry(stt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn = liobn)
+			return stt;
+	}
+
+	return NULL;
+}
+
+/*
+ * Converts guest physical address into host virtual
+ * which is to be used later in get_user_pages_fast().
+ */
+static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return 0;
+
+	/*
+	 * Convert gfn to hva preserving flags and an offset
+	 * within a system page
+	 */
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+	/* Find out the page pte and size if requested */
+	if (ptep && pg_sizep) {
+		pte_t pte;
+		unsigned long pg_size = 0;
+
+		pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+				writing, &pg_size);
+		if (!pte_present(pte))
+			return 0;
+
+		*pg_sizep = pg_size;
+		*ptep = pte;
+	}
+
+	return hva;
+}
+
+/*
+ * Converts guest physical address into host real address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long get_real_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, pg_size = 0, hwaddr, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return 0;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return 0;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	/* Copy outer values if required */
+	if (pg_sizep)
+		*pg_sizep = pg_size;
+	if (ptep)
+		*ptep = pte;
+
+	return hwaddr;
+}
+
+/*
+ * emulated_h_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Works in both real and virtual modes.
+ */
+static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
+	/* 	    liobn, stt, stt->window_size); */
+	if (ioba >= stt->window_size) {
+		pr_err("%s failed on ioba=%lx\n", __func__, ioba);
+		return H_PARAMETER;
+	}
+
+	page = stt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/* FIXME: Need to validate the TCE itself */
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+
+	return H_SUCCESS;
+}
+
+/*
+ * Real mode handlers
  */
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return emulated_h_put_tce(stt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret = 0;
+	unsigned long *tces;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
 
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn = liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
+	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
+	if (!tces)
+		return H_TOO_HARD;
 
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tces[i]);
 
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
+	return ret;
+}
 
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
-	}
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret = 0;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+	return ret;
+}
+
+/*
+ * Virtual mode handlers
+ */
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	/* At the moment emulated IO is handled the same way */
+	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+}
+
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	unsigned long *tces;
+	long ret = 0, i;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
+	if (!tces)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+	return ret;
+}
+
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	/* At the moment emulated IO is handled the same way */
+	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 71d0c90..13c8436 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 					kvmppc_get_gpr(vcpu, 5),
 					kvmppc_get_gpr(vcpu, 6));
 		break;
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6));
+		if (ret = H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret = H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret = H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 10b6c35..0826e8b 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1390,6 +1390,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index ee02b30..270e88e 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce, 1);
+	if (rc = H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc = H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc = H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -240,6 +271,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70739a0..95614c7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_PPC_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e6e5d4b..26e2b271 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQFD_RESAMPLE 82
 #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
 #define KVM_CAP_PPC_HTAB_FD 84
+#define KVM_CAP_PPC_MULTITCE 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] powerpc: preparing to support real mode optimization
       [not found] <1360584763-21988-1-git-send-email-a>
  2013-02-11 12:12   ` aik
@ 2013-02-11 12:12   ` aik
  2013-02-11 12:12   ` aik
  2013-02-11 12:12   ` aik
  3 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alexander Graf,
	Michael Ellerman, linuxppc-dev, linux-kernel, kvm-ppc, kvm,
	David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

he current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow in really fast hardware so
it is better to be moved to the real mode.

The patch adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEN are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    3 ++
 arch/powerpc/mm/init_64.c                |   56 +++++++++++++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index ddcc898..b7a1fb2 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *vmemmap_pfn_to_page(unsigned long pfn);
+long vmemmap_get_page(struct page *page);
+long vmemmap_put_page(struct page *page);
 pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 		int writing, unsigned long *pte_sizep);
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..068e9e9 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,59 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+struct page *vmemmap_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
+
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *vmemmap_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+long vmemmap_get_page(struct page *page)
+{
+	if (PageTail(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+
+long vmemmap_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	put_page(page);
+
+	return 0;
+}
+#endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] powerpc: preparing to support real mode optimization
@ 2013-02-11 12:12   ` aik
  0 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

he current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow in really fast hardware so
it is better to be moved to the real mode.

The patch adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEN are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    3 ++
 arch/powerpc/mm/init_64.c                |   56 +++++++++++++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index ddcc898..b7a1fb2 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *vmemmap_pfn_to_page(unsigned long pfn);
+long vmemmap_get_page(struct page *page);
+long vmemmap_put_page(struct page *page);
 pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 		int writing, unsigned long *pte_sizep);
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..068e9e9 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,59 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+struct page *vmemmap_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
+
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *vmemmap_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+long vmemmap_get_page(struct page *page)
+{
+	if (PageTail(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+
+long vmemmap_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	put_page(page);
+
+	return 0;
+}
+#endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] powerpc: preparing to support real mode optimization
@ 2013-02-11 12:12   ` aik
  0 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alexander Graf,
	Michael Ellerman, linuxppc-dev, linux-kernel, kvm-ppc, kvm,
	David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

he current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow in really fast hardware so
it is better to be moved to the real mode.

The patch adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEN are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    3 ++
 arch/powerpc/mm/init_64.c                |   56 +++++++++++++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index ddcc898..b7a1fb2 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *vmemmap_pfn_to_page(unsigned long pfn);
+long vmemmap_get_page(struct page *page);
+long vmemmap_put_page(struct page *page);
 pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 		int writing, unsigned long *pte_sizep);
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..068e9e9 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,59 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+struct page *vmemmap_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
+
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *vmemmap_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+long vmemmap_get_page(struct page *page)
+{
+	if (PageTail(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+
+long vmemmap_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	put_page(page);
+
+	return 0;
+}
+#endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] vfio powerpc: added real mode support
       [not found] <1360584763-21988-1-git-send-email-a>
  2013-02-11 12:12   ` aik
@ 2013-02-11 12:12   ` aik
  2013-02-11 12:12   ` aik
  2013-02-11 12:12   ` aik
  3 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alexander Graf,
	Michael Ellerman, linuxppc-dev, linux-kernel, kvm-ppc, kvm,
	David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The patch allows the host kernel to handle H_PUT_TCE request
without involving QEMU in it what should save time on switching
from the kernel to QEMU and back.

The patch adds an IOMMU ID parameter into the KVM_CAP_SPAPR_TCE ioctl,
QEMU needs to be fixed to support that.

At the moment H_PUT_TCE is processed in the virtual mode as the page
to be mapped may not be present in the RAM so paging may be involved as
it can be done from the virtual mode only.

Tests show that this patch increases tranmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h    |   10 ++
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    8 ++
 arch/powerpc/kernel/iommu.c         |  253 +++++++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_64_vio.c    |   55 +++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  186 +++++++++++++++++++++++--
 arch/powerpc/kvm/powerpc.c          |   11 ++
 include/uapi/linux/kvm.h            |    1 +
 9 files changed, 503 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 900294b..4a479e6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
+	struct list_head it_hugepages;
 #endif
 };
 
@@ -158,6 +159,15 @@ extern long iommu_clear_tce_user_mode(struct iommu_table *tbl,
 		unsigned long npages);
 extern long iommu_put_tce_user_mode(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce);
+extern long iommu_put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size);
+extern long iommu_clear_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages);
+extern long iommu_put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern long iommu_lock_table(struct iommu_table *tbl, bool lock);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ca9bf45..6fb22f8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	bool virtmode_only;
+	struct iommu_table *tbl;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 76d133b..45c2a6c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -134,6 +134,8 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			     unsigned long ioba, unsigned long tce);
 extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 2fba8a6..9578696 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -291,6 +291,14 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+#define SPAPR_TCE_PUT_TCE_VIRTMODE_ONLY	1 /* for debug purposes */
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b4fdabc..acb9cdc 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -47,6 +47,8 @@
 #include <asm/fadump.h>
 #include <asm/vio.h>
 #include <asm/tce.h>
+#include <asm/kvm_book3s_64.h>
+#include <asm/page.h>
 
 #define DBG(...)
 
@@ -727,6 +729,7 @@ void iommu_register_group(struct iommu_table * tbl,
 		return;
 	}
 	tbl->it_group = grp;
+	INIT_LIST_HEAD(&tbl->it_hugepages);
 	iommu_group_set_iommudata(grp, tbl, group_release);
 	iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			domain_number, pe_num));
@@ -906,6 +909,83 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 }
 
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct iommu_kvmppc_hugepages {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long pa;	/* Base phys address used as a real TCE */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+	bool dirty;		/* Dirty bit */
+};
+
+static struct iommu_kvmppc_hugepages *find_hp_by_pte(struct iommu_table *tbl,
+		pte_t pte)
+{
+	struct iommu_kvmppc_hugepages *hp;
+
+	list_for_each_entry(hp, &tbl->it_hugepages, list) {
+		if (hp->pte == pte)
+			return hp;
+	}
+
+	return NULL;
+}
+
+static struct iommu_kvmppc_hugepages *find_hp_by_pa(struct iommu_table *tbl,
+		unsigned long pa)
+{
+	struct iommu_kvmppc_hugepages *hp;
+
+	list_for_each_entry(hp, &tbl->it_hugepages, list) {
+		if ((hp->pa <= pa) && (pa < hp->pa + hp->size))
+			return hp;
+	}
+
+	return NULL;
+}
+
+static struct iommu_kvmppc_hugepages *add_hp(struct iommu_table *tbl,
+		pte_t pte, unsigned long va, unsigned long pg_size)
+{
+	int ret;
+	struct iommu_kvmppc_hugepages *hp;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp)
+		return NULL;
+
+	hp->pte = pte;
+	va = va & ~(pg_size - 1);
+	ret = get_user_pages_fast(va, 1, true/*write*/, &hp->page);
+	if ((ret != 1) || !hp->page) {
+		kfree(hp);
+		return NULL;
+	}
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	hp->pa = __pa((unsigned long) page_address(hp->page));
+
+	hp->size = pg_size;
+
+	list_add(&hp->list, &tbl->it_hugepages);
+
+	return hp;
+}
+
 static enum dma_data_direction tce_direction(unsigned long tce)
 {
 	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
@@ -974,14 +1054,16 @@ static long tce_put_param_check(struct iommu_table *tbl,
 	return 0;
 }
 
-static long clear_tce(struct iommu_table *tbl,
+static long clear_tce(struct iommu_table *tbl, bool realmode,
 		unsigned long entry, unsigned long pages)
 {
+	long ret = 0;
 	unsigned long oldtce;
 	struct page *page;
 	struct iommu_pool *pool;
+	struct iommu_kvmppc_hugepages *hp;
 
-	for ( ; pages; --pages, ++entry) {
+	for ( ; pages && !ret; --pages, ++entry) {
 		pool = get_pool(tbl, entry);
 		spin_lock(&(pool->lock));
 
@@ -989,12 +1071,32 @@ static long clear_tce(struct iommu_table *tbl,
 		if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) {
 			ppc_md.tce_free(tbl, entry, 1);
 
-			page = pfn_to_page(oldtce >> PAGE_SHIFT);
-			WARN_ON(!page);
-			if (page) {
+			/* Release of huge pages is postponed till KVM's exit */
+			hp = find_hp_by_pa(tbl, oldtce);
+			if (hp) {
 				if (oldtce & TCE_PCI_WRITE)
-					SetPageDirty(page);
-				put_page(page);
+					hp->dirty = true;
+			} else if (realmode) {
+				/* Release a small page in real mode */
+				page = vmemmap_pfn_to_page(
+						oldtce >> PAGE_SHIFT);
+				if (page) {
+					if (oldtce & TCE_PCI_WRITE)
+						SetPageDirty(page);
+					ret = vmemmap_put_page(page);
+				} else {
+					/* Retry in virtual mode */
+					ret = -EAGAIN;
+				}
+			} else {
+				/* Release a small page in virtual mode */
+				page = pfn_to_page(oldtce >> PAGE_SHIFT);
+				WARN_ON(!page);
+				if (page) {
+					if (oldtce & TCE_PCI_WRITE)
+						SetPageDirty(page);
+					put_page(page);
+				}
 			}
 		}
 		spin_unlock(&(pool->lock));
@@ -1011,7 +1113,7 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 
 	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
 	if (!ret)
-		ret = clear_tce(tbl, entry, npages);
+		ret = clear_tce(tbl, false, entry, npages);
 
 	if (ret < 0)
 		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
@@ -1021,6 +1123,24 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tce_user_mode);
 
+long iommu_clear_tce_real_mode(struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (!ret)
+		ret = clear_tce(tbl, true, entry, npages);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
+				__func__, ioba, tce_value, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_clear_tce_real_mode);
+
 /* hwaddr is a virtual address here, tce_build converts it to physical */
 static long do_tce_build(struct iommu_table *tbl, unsigned long entry,
 		unsigned long hwaddr, enum dma_data_direction direction)
@@ -1088,6 +1208,112 @@ long iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 }
 EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
+static long put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	struct iommu_kvmppc_hugepages *hp;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	/* Small page size case, easy to handle... */
+	if (pg_size <= PAGE_SIZE)
+		return put_tce_user_mode(tbl, entry, tce);
+
+	/*
+	 * Hugepages case - manage the hugepage list.
+	 * find_hp_by_pte() may find a huge page if called
+	 * from h_put_tce_indirect call.
+	 */
+	hp = find_hp_by_pte(tbl, pte);
+	if (!hp) {
+		/* This is the first time usage of this huge page */
+		hp = add_hp(tbl, pte, tce, pg_size);
+		if (!hp)
+			return -EFAULT;
+	}
+
+	tce = (unsigned long) __va(hp->pa) + (tce & (pg_size - 1));
+
+	return do_tce_build(tbl, entry, tce, direction);
+}
+
+long iommu_put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_virt_mode(tbl, entry, tce, pte, pg_size);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_virt_mode);
+
+static long put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	struct page *page = NULL;
+	struct iommu_kvmppc_hugepages *hp = NULL;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	/* This is a huge page. we continue only if it is already in the list */
+	if (pg_size > PAGE_SIZE) {
+		hp = find_hp_by_pte(tbl, pte);
+
+		/* Go to virt mode to add a hugepage to the list if not found */
+		if (!hp)
+			return -EAGAIN;
+
+		/* tce_build accepts virtual addresses */
+		return do_tce_build(tbl, entry, (unsigned long) __va(tce),
+				direction);
+	}
+
+	/* Small page case, find page struct to increment a counter */
+	page = vmemmap_pfn_to_page(tce >> PAGE_SHIFT);
+	if (!page)
+		return -EAGAIN;
+
+	ret = vmemmap_get_page(page);
+	if (ret)
+		return ret;
+
+	/* tce_build accepts virtual addresses */
+	ret = do_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
+	if (ret)
+		vmemmap_put_page(page);
+
+	return ret;
+}
+
+long iommu_put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_real_mode(tbl, entry, tce, pte, pg_size);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_real_mode);
+
 /*
  * Helpers to do locked pages accounting.
  * Called from ioctl so down_write_trylock is not necessary.
@@ -1111,6 +1337,7 @@ long iommu_lock_table(struct iommu_table *tbl, bool lock)
 {
 	unsigned long sz = (tbl->it_size + 7) >> 3;
 	unsigned long locked, lock_limit;
+	struct iommu_kvmppc_hugepages *hp, *tmp;
 
 	if (lock) {
 		/*
@@ -1139,9 +1366,17 @@ long iommu_lock_table(struct iommu_table *tbl, bool lock)
 	}
 
 	/* Clear TCE table */
-	clear_tce(tbl, tbl->it_offset, tbl->it_size);
+	clear_tce(tbl, false, tbl->it_offset, tbl->it_size);
 
 	if (!lock) {
+		list_for_each_entry_safe(hp, tmp, &tbl->it_hugepages, list) {
+			list_del(&hp->list);
+			if (hp->dirty)
+				SetPageDirty(hp->page);
+			put_page(hp->page);
+			kfree(hp);
+		}
+
 		lock_acct(-tbl->it_size);
 		memset(tbl->it_map, 0, sz);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 72ffc89..c3c29a0 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -26,6 +26,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -36,6 +38,7 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
@@ -52,8 +55,10 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+	if (!stt->tbl) {
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
+	}
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -148,3 +153,49 @@ fail:
 	}
 	return ret;
 }
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	struct pci_dev *pdev = NULL;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	stt = kzalloc(sizeof(*stt), GFP_KERNEL);
+	if (!stt)
+		return -ENOMEM;
+
+	stt->liobn = args->liobn;
+	stt->kvm = kvm;
+	stt->virtmode_only = !!(args->flags & SPAPR_TCE_PUT_TCE_VIRTMODE_ONLY);
+
+	/* Find an IOMMU table for the given ID */
+	for_each_pci_dev(pdev) {
+		struct iommu_table *tbl;
+
+		tbl = get_iommu_table_base(&pdev->dev);
+		if (!tbl)
+			continue;
+		if (iommu_group_id(tbl->it_group) != args->iommu_id)
+			continue;
+
+		stt->tbl = tbl;
+		pr_info("LIOBN=%llX hooked to IOMMU %d, virtmode_only=%u\n",
+				stt->liobn, args->iommu_id, stt->virtmode_only);
+		break;
+	}
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	return 0;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c38edcd..b2aa957 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -171,6 +171,7 @@ static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
+	long ret;
 	struct kvmppc_spapr_tce_table *stt;
 
 	stt = find_tce_table(vcpu, liobn);
@@ -178,8 +179,37 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	/* Emulated IO */
-	return emulated_h_put_tce(stt, ioba, tce);
+	if (!stt->tbl)
+		return emulated_h_put_tce(stt, ioba, tce);
+
+	/* VFIO IOMMU */
+	if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte = 0;
+
+		hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_real_mode(stt->tbl, ioba, hpa,
+				pte, pg_size);
+	} else {
+		ret = iommu_clear_tce_real_mode(stt->tbl, ioba, 0, 1);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
 
 long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
@@ -195,15 +225,43 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
 	if (!tces)
 		return H_TOO_HARD;
 
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte = 0;
+
+		hpa = get_real_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_real_mode(stt->tbl,
+				ioba, hpa, pte, pg_size);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
@@ -218,11 +276,28 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tce_value);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	ret = iommu_clear_tce_real_mode(stt->tbl, ioba, tce_value, npages);
+ 	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 /*
@@ -232,8 +307,42 @@ extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce)
 {
-	/* At the moment emulated IO is handled the same way */
-	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	long ret;
+	struct kvmppc_spapr_tce_table *stt;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO is not supported in virt mode */
+	if (!stt->tbl)
+		return emulated_h_put_tce(stt, ioba, tce);
+
+	/* VFIO IOMMU */
+	if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte;
+
+		hpa = get_virt_address(vcpu, tce, tce & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!tce)
+			return -EFAULT;
+
+		ret = iommu_put_tce_virt_mode(stt->tbl, ioba, hpa,
+				pte, pg_size);
+	} else {
+		ret = iommu_clear_tce_user_mode(stt->tbl, ioba, 0, 1);
+	}
+ 	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
 
 extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
@@ -254,16 +363,65 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		return H_TOO_HARD;
 
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte;
+
+		hpa = get_virt_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_virt_mode(stt->tbl,
+				ioba, hpa, pte, pg_size);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages)
 {
-	/* At the moment emulated IO is handled the same way */
-	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
+	struct kvmppc_spapr_tce_table *stt;
+	long ret = 0, i;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	ret = iommu_clear_tce_user_mode(stt->tbl, ioba, tce_value, npages);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95614c7..beceb90 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -934,6 +934,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 26e2b271..3727ea6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -863,6 +863,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] vfio powerpc: added real mode support
@ 2013-02-11 12:12   ` aik
  0 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The patch allows the host kernel to handle H_PUT_TCE request
without involving QEMU in it what should save time on switching
from the kernel to QEMU and back.

The patch adds an IOMMU ID parameter into the KVM_CAP_SPAPR_TCE ioctl,
QEMU needs to be fixed to support that.

At the moment H_PUT_TCE is processed in the virtual mode as the page
to be mapped may not be present in the RAM so paging may be involved as
it can be done from the virtual mode only.

Tests show that this patch increases tranmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h    |   10 ++
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    8 ++
 arch/powerpc/kernel/iommu.c         |  253 +++++++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_64_vio.c    |   55 +++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  186 +++++++++++++++++++++++--
 arch/powerpc/kvm/powerpc.c          |   11 ++
 include/uapi/linux/kvm.h            |    1 +
 9 files changed, 503 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 900294b..4a479e6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
+	struct list_head it_hugepages;
 #endif
 };
 
@@ -158,6 +159,15 @@ extern long iommu_clear_tce_user_mode(struct iommu_table *tbl,
 		unsigned long npages);
 extern long iommu_put_tce_user_mode(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce);
+extern long iommu_put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size);
+extern long iommu_clear_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages);
+extern long iommu_put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern long iommu_lock_table(struct iommu_table *tbl, bool lock);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ca9bf45..6fb22f8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	bool virtmode_only;
+	struct iommu_table *tbl;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 76d133b..45c2a6c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -134,6 +134,8 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			     unsigned long ioba, unsigned long tce);
 extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 2fba8a6..9578696 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -291,6 +291,14 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+#define SPAPR_TCE_PUT_TCE_VIRTMODE_ONLY	1 /* for debug purposes */
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b4fdabc..acb9cdc 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -47,6 +47,8 @@
 #include <asm/fadump.h>
 #include <asm/vio.h>
 #include <asm/tce.h>
+#include <asm/kvm_book3s_64.h>
+#include <asm/page.h>
 
 #define DBG(...)
 
@@ -727,6 +729,7 @@ void iommu_register_group(struct iommu_table * tbl,
 		return;
 	}
 	tbl->it_group = grp;
+	INIT_LIST_HEAD(&tbl->it_hugepages);
 	iommu_group_set_iommudata(grp, tbl, group_release);
 	iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			domain_number, pe_num));
@@ -906,6 +909,83 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 }
 
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct iommu_kvmppc_hugepages {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long pa;	/* Base phys address used as a real TCE */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+	bool dirty;		/* Dirty bit */
+};
+
+static struct iommu_kvmppc_hugepages *find_hp_by_pte(struct iommu_table *tbl,
+		pte_t pte)
+{
+	struct iommu_kvmppc_hugepages *hp;
+
+	list_for_each_entry(hp, &tbl->it_hugepages, list) {
+		if (hp->pte == pte)
+			return hp;
+	}
+
+	return NULL;
+}
+
+static struct iommu_kvmppc_hugepages *find_hp_by_pa(struct iommu_table *tbl,
+		unsigned long pa)
+{
+	struct iommu_kvmppc_hugepages *hp;
+
+	list_for_each_entry(hp, &tbl->it_hugepages, list) {
+		if ((hp->pa <= pa) && (pa < hp->pa + hp->size))
+			return hp;
+	}
+
+	return NULL;
+}
+
+static struct iommu_kvmppc_hugepages *add_hp(struct iommu_table *tbl,
+		pte_t pte, unsigned long va, unsigned long pg_size)
+{
+	int ret;
+	struct iommu_kvmppc_hugepages *hp;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp)
+		return NULL;
+
+	hp->pte = pte;
+	va = va & ~(pg_size - 1);
+	ret = get_user_pages_fast(va, 1, true/*write*/, &hp->page);
+	if ((ret != 1) || !hp->page) {
+		kfree(hp);
+		return NULL;
+	}
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	hp->pa = __pa((unsigned long) page_address(hp->page));
+
+	hp->size = pg_size;
+
+	list_add(&hp->list, &tbl->it_hugepages);
+
+	return hp;
+}
+
 static enum dma_data_direction tce_direction(unsigned long tce)
 {
 	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
@@ -974,14 +1054,16 @@ static long tce_put_param_check(struct iommu_table *tbl,
 	return 0;
 }
 
-static long clear_tce(struct iommu_table *tbl,
+static long clear_tce(struct iommu_table *tbl, bool realmode,
 		unsigned long entry, unsigned long pages)
 {
+	long ret = 0;
 	unsigned long oldtce;
 	struct page *page;
 	struct iommu_pool *pool;
+	struct iommu_kvmppc_hugepages *hp;
 
-	for ( ; pages; --pages, ++entry) {
+	for ( ; pages && !ret; --pages, ++entry) {
 		pool = get_pool(tbl, entry);
 		spin_lock(&(pool->lock));
 
@@ -989,12 +1071,32 @@ static long clear_tce(struct iommu_table *tbl,
 		if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) {
 			ppc_md.tce_free(tbl, entry, 1);
 
-			page = pfn_to_page(oldtce >> PAGE_SHIFT);
-			WARN_ON(!page);
-			if (page) {
+			/* Release of huge pages is postponed till KVM's exit */
+			hp = find_hp_by_pa(tbl, oldtce);
+			if (hp) {
 				if (oldtce & TCE_PCI_WRITE)
-					SetPageDirty(page);
-				put_page(page);
+					hp->dirty = true;
+			} else if (realmode) {
+				/* Release a small page in real mode */
+				page = vmemmap_pfn_to_page(
+						oldtce >> PAGE_SHIFT);
+				if (page) {
+					if (oldtce & TCE_PCI_WRITE)
+						SetPageDirty(page);
+					ret = vmemmap_put_page(page);
+				} else {
+					/* Retry in virtual mode */
+					ret = -EAGAIN;
+				}
+			} else {
+				/* Release a small page in virtual mode */
+				page = pfn_to_page(oldtce >> PAGE_SHIFT);
+				WARN_ON(!page);
+				if (page) {
+					if (oldtce & TCE_PCI_WRITE)
+						SetPageDirty(page);
+					put_page(page);
+				}
 			}
 		}
 		spin_unlock(&(pool->lock));
@@ -1011,7 +1113,7 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 
 	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
 	if (!ret)
-		ret = clear_tce(tbl, entry, npages);
+		ret = clear_tce(tbl, false, entry, npages);
 
 	if (ret < 0)
 		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
@@ -1021,6 +1123,24 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tce_user_mode);
 
+long iommu_clear_tce_real_mode(struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (!ret)
+		ret = clear_tce(tbl, true, entry, npages);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
+				__func__, ioba, tce_value, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_clear_tce_real_mode);
+
 /* hwaddr is a virtual address here, tce_build converts it to physical */
 static long do_tce_build(struct iommu_table *tbl, unsigned long entry,
 		unsigned long hwaddr, enum dma_data_direction direction)
@@ -1088,6 +1208,112 @@ long iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 }
 EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
+static long put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	struct iommu_kvmppc_hugepages *hp;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	/* Small page size case, easy to handle... */
+	if (pg_size <= PAGE_SIZE)
+		return put_tce_user_mode(tbl, entry, tce);
+
+	/*
+	 * Hugepages case - manage the hugepage list.
+	 * find_hp_by_pte() may find a huge page if called
+	 * from h_put_tce_indirect call.
+	 */
+	hp = find_hp_by_pte(tbl, pte);
+	if (!hp) {
+		/* This is the first time usage of this huge page */
+		hp = add_hp(tbl, pte, tce, pg_size);
+		if (!hp)
+			return -EFAULT;
+	}
+
+	tce = (unsigned long) __va(hp->pa) + (tce & (pg_size - 1));
+
+	return do_tce_build(tbl, entry, tce, direction);
+}
+
+long iommu_put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_virt_mode(tbl, entry, tce, pte, pg_size);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_virt_mode);
+
+static long put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	struct page *page = NULL;
+	struct iommu_kvmppc_hugepages *hp = NULL;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	/* This is a huge page. we continue only if it is already in the list */
+	if (pg_size > PAGE_SIZE) {
+		hp = find_hp_by_pte(tbl, pte);
+
+		/* Go to virt mode to add a hugepage to the list if not found */
+		if (!hp)
+			return -EAGAIN;
+
+		/* tce_build accepts virtual addresses */
+		return do_tce_build(tbl, entry, (unsigned long) __va(tce),
+				direction);
+	}
+
+	/* Small page case, find page struct to increment a counter */
+	page = vmemmap_pfn_to_page(tce >> PAGE_SHIFT);
+	if (!page)
+		return -EAGAIN;
+
+	ret = vmemmap_get_page(page);
+	if (ret)
+		return ret;
+
+	/* tce_build accepts virtual addresses */
+	ret = do_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
+	if (ret)
+		vmemmap_put_page(page);
+
+	return ret;
+}
+
+long iommu_put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_real_mode(tbl, entry, tce, pte, pg_size);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_real_mode);
+
 /*
  * Helpers to do locked pages accounting.
  * Called from ioctl so down_write_trylock is not necessary.
@@ -1111,6 +1337,7 @@ long iommu_lock_table(struct iommu_table *tbl, bool lock)
 {
 	unsigned long sz = (tbl->it_size + 7) >> 3;
 	unsigned long locked, lock_limit;
+	struct iommu_kvmppc_hugepages *hp, *tmp;
 
 	if (lock) {
 		/*
@@ -1139,9 +1366,17 @@ long iommu_lock_table(struct iommu_table *tbl, bool lock)
 	}
 
 	/* Clear TCE table */
-	clear_tce(tbl, tbl->it_offset, tbl->it_size);
+	clear_tce(tbl, false, tbl->it_offset, tbl->it_size);
 
 	if (!lock) {
+		list_for_each_entry_safe(hp, tmp, &tbl->it_hugepages, list) {
+			list_del(&hp->list);
+			if (hp->dirty)
+				SetPageDirty(hp->page);
+			put_page(hp->page);
+			kfree(hp);
+		}
+
 		lock_acct(-tbl->it_size);
 		memset(tbl->it_map, 0, sz);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 72ffc89..c3c29a0 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -26,6 +26,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -36,6 +38,7 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
@@ -52,8 +55,10 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+	if (!stt->tbl) {
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
+	}
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -148,3 +153,49 @@ fail:
 	}
 	return ret;
 }
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	struct pci_dev *pdev = NULL;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	stt = kzalloc(sizeof(*stt), GFP_KERNEL);
+	if (!stt)
+		return -ENOMEM;
+
+	stt->liobn = args->liobn;
+	stt->kvm = kvm;
+	stt->virtmode_only = !!(args->flags & SPAPR_TCE_PUT_TCE_VIRTMODE_ONLY);
+
+	/* Find an IOMMU table for the given ID */
+	for_each_pci_dev(pdev) {
+		struct iommu_table *tbl;
+
+		tbl = get_iommu_table_base(&pdev->dev);
+		if (!tbl)
+			continue;
+		if (iommu_group_id(tbl->it_group) != args->iommu_id)
+			continue;
+
+		stt->tbl = tbl;
+		pr_info("LIOBN=%llX hooked to IOMMU %d, virtmode_only=%u\n",
+				stt->liobn, args->iommu_id, stt->virtmode_only);
+		break;
+	}
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	return 0;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c38edcd..b2aa957 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -171,6 +171,7 @@ static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
+	long ret;
 	struct kvmppc_spapr_tce_table *stt;
 
 	stt = find_tce_table(vcpu, liobn);
@@ -178,8 +179,37 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	/* Emulated IO */
-	return emulated_h_put_tce(stt, ioba, tce);
+	if (!stt->tbl)
+		return emulated_h_put_tce(stt, ioba, tce);
+
+	/* VFIO IOMMU */
+	if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte = 0;
+
+		hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_real_mode(stt->tbl, ioba, hpa,
+				pte, pg_size);
+	} else {
+		ret = iommu_clear_tce_real_mode(stt->tbl, ioba, 0, 1);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
 
 long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
@@ -195,15 +225,43 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
 	if (!tces)
 		return H_TOO_HARD;
 
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte = 0;
+
+		hpa = get_real_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_real_mode(stt->tbl,
+				ioba, hpa, pte, pg_size);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
@@ -218,11 +276,28 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tce_value);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	ret = iommu_clear_tce_real_mode(stt->tbl, ioba, tce_value, npages);
+ 	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 /*
@@ -232,8 +307,42 @@ extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce)
 {
-	/* At the moment emulated IO is handled the same way */
-	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	long ret;
+	struct kvmppc_spapr_tce_table *stt;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO is not supported in virt mode */
+	if (!stt->tbl)
+		return emulated_h_put_tce(stt, ioba, tce);
+
+	/* VFIO IOMMU */
+	if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte;
+
+		hpa = get_virt_address(vcpu, tce, tce & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!tce)
+			return -EFAULT;
+
+		ret = iommu_put_tce_virt_mode(stt->tbl, ioba, hpa,
+				pte, pg_size);
+	} else {
+		ret = iommu_clear_tce_user_mode(stt->tbl, ioba, 0, 1);
+	}
+ 	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
 
 extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
@@ -254,16 +363,65 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		return H_TOO_HARD;
 
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte;
+
+		hpa = get_virt_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_virt_mode(stt->tbl,
+				ioba, hpa, pte, pg_size);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages)
 {
-	/* At the moment emulated IO is handled the same way */
-	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
+	struct kvmppc_spapr_tce_table *stt;
+	long ret = 0, i;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	ret = iommu_clear_tce_user_mode(stt->tbl, ioba, tce_value, npages);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95614c7..beceb90 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -934,6 +934,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 26e2b271..3727ea6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -863,6 +863,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] vfio powerpc: added real mode support
@ 2013-02-11 12:12   ` aik
  0 siblings, 0 replies; 27+ messages in thread
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Paul Mackerras, Alexander Graf,
	Michael Ellerman, linuxppc-dev, linux-kernel, kvm-ppc, kvm,
	David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The patch allows the host kernel to handle H_PUT_TCE request
without involving QEMU in it what should save time on switching
from the kernel to QEMU and back.

The patch adds an IOMMU ID parameter into the KVM_CAP_SPAPR_TCE ioctl,
QEMU needs to be fixed to support that.

At the moment H_PUT_TCE is processed in the virtual mode as the page
to be mapped may not be present in the RAM so paging may be involved as
it can be done from the virtual mode only.

Tests show that this patch increases tranmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h    |   10 ++
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    8 ++
 arch/powerpc/kernel/iommu.c         |  253 +++++++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_64_vio.c    |   55 +++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  186 +++++++++++++++++++++++--
 arch/powerpc/kvm/powerpc.c          |   11 ++
 include/uapi/linux/kvm.h            |    1 +
 9 files changed, 503 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 900294b..4a479e6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
+	struct list_head it_hugepages;
 #endif
 };
 
@@ -158,6 +159,15 @@ extern long iommu_clear_tce_user_mode(struct iommu_table *tbl,
 		unsigned long npages);
 extern long iommu_put_tce_user_mode(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce);
+extern long iommu_put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size);
+extern long iommu_clear_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages);
+extern long iommu_put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern long iommu_lock_table(struct iommu_table *tbl, bool lock);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ca9bf45..6fb22f8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	bool virtmode_only;
+	struct iommu_table *tbl;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 76d133b..45c2a6c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -134,6 +134,8 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			     unsigned long ioba, unsigned long tce);
 extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 2fba8a6..9578696 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -291,6 +291,14 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+#define SPAPR_TCE_PUT_TCE_VIRTMODE_ONLY	1 /* for debug purposes */
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b4fdabc..acb9cdc 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -47,6 +47,8 @@
 #include <asm/fadump.h>
 #include <asm/vio.h>
 #include <asm/tce.h>
+#include <asm/kvm_book3s_64.h>
+#include <asm/page.h>
 
 #define DBG(...)
 
@@ -727,6 +729,7 @@ void iommu_register_group(struct iommu_table * tbl,
 		return;
 	}
 	tbl->it_group = grp;
+	INIT_LIST_HEAD(&tbl->it_hugepages);
 	iommu_group_set_iommudata(grp, tbl, group_release);
 	iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			domain_number, pe_num));
@@ -906,6 +909,83 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 }
 
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct iommu_kvmppc_hugepages {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long pa;	/* Base phys address used as a real TCE */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+	bool dirty;		/* Dirty bit */
+};
+
+static struct iommu_kvmppc_hugepages *find_hp_by_pte(struct iommu_table *tbl,
+		pte_t pte)
+{
+	struct iommu_kvmppc_hugepages *hp;
+
+	list_for_each_entry(hp, &tbl->it_hugepages, list) {
+		if (hp->pte = pte)
+			return hp;
+	}
+
+	return NULL;
+}
+
+static struct iommu_kvmppc_hugepages *find_hp_by_pa(struct iommu_table *tbl,
+		unsigned long pa)
+{
+	struct iommu_kvmppc_hugepages *hp;
+
+	list_for_each_entry(hp, &tbl->it_hugepages, list) {
+		if ((hp->pa <= pa) && (pa < hp->pa + hp->size))
+			return hp;
+	}
+
+	return NULL;
+}
+
+static struct iommu_kvmppc_hugepages *add_hp(struct iommu_table *tbl,
+		pte_t pte, unsigned long va, unsigned long pg_size)
+{
+	int ret;
+	struct iommu_kvmppc_hugepages *hp;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp)
+		return NULL;
+
+	hp->pte = pte;
+	va = va & ~(pg_size - 1);
+	ret = get_user_pages_fast(va, 1, true/*write*/, &hp->page);
+	if ((ret != 1) || !hp->page) {
+		kfree(hp);
+		return NULL;
+	}
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	hp->pa = __pa((unsigned long) page_address(hp->page));
+
+	hp->size = pg_size;
+
+	list_add(&hp->list, &tbl->it_hugepages);
+
+	return hp;
+}
+
 static enum dma_data_direction tce_direction(unsigned long tce)
 {
 	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
@@ -974,14 +1054,16 @@ static long tce_put_param_check(struct iommu_table *tbl,
 	return 0;
 }
 
-static long clear_tce(struct iommu_table *tbl,
+static long clear_tce(struct iommu_table *tbl, bool realmode,
 		unsigned long entry, unsigned long pages)
 {
+	long ret = 0;
 	unsigned long oldtce;
 	struct page *page;
 	struct iommu_pool *pool;
+	struct iommu_kvmppc_hugepages *hp;
 
-	for ( ; pages; --pages, ++entry) {
+	for ( ; pages && !ret; --pages, ++entry) {
 		pool = get_pool(tbl, entry);
 		spin_lock(&(pool->lock));
 
@@ -989,12 +1071,32 @@ static long clear_tce(struct iommu_table *tbl,
 		if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) {
 			ppc_md.tce_free(tbl, entry, 1);
 
-			page = pfn_to_page(oldtce >> PAGE_SHIFT);
-			WARN_ON(!page);
-			if (page) {
+			/* Release of huge pages is postponed till KVM's exit */
+			hp = find_hp_by_pa(tbl, oldtce);
+			if (hp) {
 				if (oldtce & TCE_PCI_WRITE)
-					SetPageDirty(page);
-				put_page(page);
+					hp->dirty = true;
+			} else if (realmode) {
+				/* Release a small page in real mode */
+				page = vmemmap_pfn_to_page(
+						oldtce >> PAGE_SHIFT);
+				if (page) {
+					if (oldtce & TCE_PCI_WRITE)
+						SetPageDirty(page);
+					ret = vmemmap_put_page(page);
+				} else {
+					/* Retry in virtual mode */
+					ret = -EAGAIN;
+				}
+			} else {
+				/* Release a small page in virtual mode */
+				page = pfn_to_page(oldtce >> PAGE_SHIFT);
+				WARN_ON(!page);
+				if (page) {
+					if (oldtce & TCE_PCI_WRITE)
+						SetPageDirty(page);
+					put_page(page);
+				}
 			}
 		}
 		spin_unlock(&(pool->lock));
@@ -1011,7 +1113,7 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 
 	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
 	if (!ret)
-		ret = clear_tce(tbl, entry, npages);
+		ret = clear_tce(tbl, false, entry, npages);
 
 	if (ret < 0)
 		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
@@ -1021,6 +1123,24 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tce_user_mode);
 
+long iommu_clear_tce_real_mode(struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (!ret)
+		ret = clear_tce(tbl, true, entry, npages);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
+				__func__, ioba, tce_value, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_clear_tce_real_mode);
+
 /* hwaddr is a virtual address here, tce_build converts it to physical */
 static long do_tce_build(struct iommu_table *tbl, unsigned long entry,
 		unsigned long hwaddr, enum dma_data_direction direction)
@@ -1088,6 +1208,112 @@ long iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 }
 EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
+static long put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	struct iommu_kvmppc_hugepages *hp;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	/* Small page size case, easy to handle... */
+	if (pg_size <= PAGE_SIZE)
+		return put_tce_user_mode(tbl, entry, tce);
+
+	/*
+	 * Hugepages case - manage the hugepage list.
+	 * find_hp_by_pte() may find a huge page if called
+	 * from h_put_tce_indirect call.
+	 */
+	hp = find_hp_by_pte(tbl, pte);
+	if (!hp) {
+		/* This is the first time usage of this huge page */
+		hp = add_hp(tbl, pte, tce, pg_size);
+		if (!hp)
+			return -EFAULT;
+	}
+
+	tce = (unsigned long) __va(hp->pa) + (tce & (pg_size - 1));
+
+	return do_tce_build(tbl, entry, tce, direction);
+}
+
+long iommu_put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_virt_mode(tbl, entry, tce, pte, pg_size);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_virt_mode);
+
+static long put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	struct page *page = NULL;
+	struct iommu_kvmppc_hugepages *hp = NULL;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	/* This is a huge page. we continue only if it is already in the list */
+	if (pg_size > PAGE_SIZE) {
+		hp = find_hp_by_pte(tbl, pte);
+
+		/* Go to virt mode to add a hugepage to the list if not found */
+		if (!hp)
+			return -EAGAIN;
+
+		/* tce_build accepts virtual addresses */
+		return do_tce_build(tbl, entry, (unsigned long) __va(tce),
+				direction);
+	}
+
+	/* Small page case, find page struct to increment a counter */
+	page = vmemmap_pfn_to_page(tce >> PAGE_SHIFT);
+	if (!page)
+		return -EAGAIN;
+
+	ret = vmemmap_get_page(page);
+	if (ret)
+		return ret;
+
+	/* tce_build accepts virtual addresses */
+	ret = do_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
+	if (ret)
+		vmemmap_put_page(page);
+
+	return ret;
+}
+
+long iommu_put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_real_mode(tbl, entry, tce, pte, pg_size);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_real_mode);
+
 /*
  * Helpers to do locked pages accounting.
  * Called from ioctl so down_write_trylock is not necessary.
@@ -1111,6 +1337,7 @@ long iommu_lock_table(struct iommu_table *tbl, bool lock)
 {
 	unsigned long sz = (tbl->it_size + 7) >> 3;
 	unsigned long locked, lock_limit;
+	struct iommu_kvmppc_hugepages *hp, *tmp;
 
 	if (lock) {
 		/*
@@ -1139,9 +1366,17 @@ long iommu_lock_table(struct iommu_table *tbl, bool lock)
 	}
 
 	/* Clear TCE table */
-	clear_tce(tbl, tbl->it_offset, tbl->it_size);
+	clear_tce(tbl, false, tbl->it_offset, tbl->it_size);
 
 	if (!lock) {
+		list_for_each_entry_safe(hp, tmp, &tbl->it_hugepages, list) {
+			list_del(&hp->list);
+			if (hp->dirty)
+				SetPageDirty(hp->page);
+			put_page(hp->page);
+			kfree(hp);
+		}
+
 		lock_acct(-tbl->it_size);
 		memset(tbl->it_map, 0, sz);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 72ffc89..c3c29a0 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -26,6 +26,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -36,6 +38,7 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
@@ -52,8 +55,10 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+	if (!stt->tbl) {
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
+	}
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -148,3 +153,49 @@ fail:
 	}
 	return ret;
 }
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	struct pci_dev *pdev = NULL;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn = args->liobn)
+			return -EBUSY;
+	}
+
+	stt = kzalloc(sizeof(*stt), GFP_KERNEL);
+	if (!stt)
+		return -ENOMEM;
+
+	stt->liobn = args->liobn;
+	stt->kvm = kvm;
+	stt->virtmode_only = !!(args->flags & SPAPR_TCE_PUT_TCE_VIRTMODE_ONLY);
+
+	/* Find an IOMMU table for the given ID */
+	for_each_pci_dev(pdev) {
+		struct iommu_table *tbl;
+
+		tbl = get_iommu_table_base(&pdev->dev);
+		if (!tbl)
+			continue;
+		if (iommu_group_id(tbl->it_group) != args->iommu_id)
+			continue;
+
+		stt->tbl = tbl;
+		pr_info("LIOBN=%llX hooked to IOMMU %d, virtmode_only=%u\n",
+				stt->liobn, args->iommu_id, stt->virtmode_only);
+		break;
+	}
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	return 0;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c38edcd..b2aa957 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -171,6 +171,7 @@ static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
+	long ret;
 	struct kvmppc_spapr_tce_table *stt;
 
 	stt = find_tce_table(vcpu, liobn);
@@ -178,8 +179,37 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	/* Emulated IO */
-	return emulated_h_put_tce(stt, ioba, tce);
+	if (!stt->tbl)
+		return emulated_h_put_tce(stt, ioba, tce);
+
+	/* VFIO IOMMU */
+	if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte = 0;
+
+		hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_real_mode(stt->tbl, ioba, hpa,
+				pte, pg_size);
+	} else {
+		ret = iommu_clear_tce_real_mode(stt->tbl, ioba, 0, 1);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret = -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
 
 long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
@@ -195,15 +225,43 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
 	if (!tces)
 		return H_TOO_HARD;
 
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte = 0;
+
+		hpa = get_real_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_real_mode(stt->tbl,
+				ioba, hpa, pte, pg_size);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret = -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
@@ -218,11 +276,28 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tce_value);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	ret = iommu_clear_tce_real_mode(stt->tbl, ioba, tce_value, npages);
+ 	iommu_flush_tce(stt->tbl);
+
+	if (ret = -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 /*
@@ -232,8 +307,42 @@ extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce)
 {
-	/* At the moment emulated IO is handled the same way */
-	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	long ret;
+	struct kvmppc_spapr_tce_table *stt;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO is not supported in virt mode */
+	if (!stt->tbl)
+		return emulated_h_put_tce(stt, ioba, tce);
+
+	/* VFIO IOMMU */
+	if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte;
+
+		hpa = get_virt_address(vcpu, tce, tce & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!tce)
+			return -EFAULT;
+
+		ret = iommu_put_tce_virt_mode(stt->tbl, ioba, hpa,
+				pte, pg_size);
+	} else {
+		ret = iommu_clear_tce_user_mode(stt->tbl, ioba, 0, 1);
+	}
+ 	iommu_flush_tce(stt->tbl);
+
+	if (ret = -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
 
 extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
@@ -254,16 +363,65 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		return H_TOO_HARD;
 
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte;
+
+		hpa = get_virt_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_virt_mode(stt->tbl,
+				ioba, hpa, pte, pg_size);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret = -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages)
 {
-	/* At the moment emulated IO is handled the same way */
-	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
+	struct kvmppc_spapr_tce_table *stt;
+	long ret = 0, i;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	ret = iommu_clear_tce_user_mode(stt->tbl, ioba, tce_value, npages);
+
+	if (ret = -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95614c7..beceb90 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -934,6 +934,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 26e2b271..3727ea6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -863,6 +863,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] powerpc: lookup_linux_pte has been made public
  2013-02-11 12:12   ` aik
  (?)
@ 2013-02-15  3:13     ` Paul Mackerras
  -1 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:13 UTC (permalink / raw)
  To: aik
  Cc: Benjamin Herrenschmidt, Alexander Graf, Michael Ellerman,
	linuxppc-dev, linux-kernel, kvm-ppc, kvm, David Gibson

On Mon, Feb 11, 2013 at 11:12:40PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> The lookup_linux_pte() function returns a linux PTE which
> is required to convert KVM guest physical address into host real
> address in real mode.
> 
> This convertion will be used by upcoming support of H_PUT_TCE_INDIRECT
> as TCE list address comes from the guest directly so it is a guest
> physical.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> ---
>  arch/powerpc/include/asm/pgtable-ppc64.h |    3 +++
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index 0182c20..ddcc898 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>  }
>  #endif /* !CONFIG_HUGETLB_PAGE */
>  
> +pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
> +		int writing, unsigned long *pte_sizep);
> +

This seems a slightly odd place to put the declaration of a function
which is defined in the KVM code.  kvm-ppc.h might be a better place.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] powerpc: lookup_linux_pte has been made public
@ 2013-02-15  3:13     ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:13 UTC (permalink / raw)
  To: aik
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, linuxppc-dev, David Gibson

On Mon, Feb 11, 2013 at 11:12:40PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> The lookup_linux_pte() function returns a linux PTE which
> is required to convert KVM guest physical address into host real
> address in real mode.
> 
> This convertion will be used by upcoming support of H_PUT_TCE_INDIRECT
> as TCE list address comes from the guest directly so it is a guest
> physical.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> ---
>  arch/powerpc/include/asm/pgtable-ppc64.h |    3 +++
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index 0182c20..ddcc898 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>  }
>  #endif /* !CONFIG_HUGETLB_PAGE */
>  
> +pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
> +		int writing, unsigned long *pte_sizep);
> +

This seems a slightly odd place to put the declaration of a function
which is defined in the KVM code.  kvm-ppc.h might be a better place.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] powerpc: lookup_linux_pte has been made public
@ 2013-02-15  3:13     ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:13 UTC (permalink / raw)
  To: aik
  Cc: Benjamin Herrenschmidt, Alexander Graf, Michael Ellerman,
	linuxppc-dev, linux-kernel, kvm-ppc, kvm, David Gibson

On Mon, Feb 11, 2013 at 11:12:40PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> The lookup_linux_pte() function returns a linux PTE which
> is required to convert KVM guest physical address into host real
> address in real mode.
> 
> This convertion will be used by upcoming support of H_PUT_TCE_INDIRECT
> as TCE list address comes from the guest directly so it is a guest
> physical.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> ---
>  arch/powerpc/include/asm/pgtable-ppc64.h |    3 +++
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index 0182c20..ddcc898 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>  }
>  #endif /* !CONFIG_HUGETLB_PAGE */
>  
> +pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
> +		int writing, unsigned long *pte_sizep);
> +

This seems a slightly odd place to put the declaration of a function
which is defined in the KVM code.  kvm-ppc.h might be a better place.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
  2013-02-11 12:12   ` aik
  (?)
@ 2013-02-15  3:24     ` Paul Mackerras
  -1 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:24 UTC (permalink / raw)
  To: aik
  Cc: Benjamin Herrenschmidt, Alexander Graf, Michael Ellerman,
	linuxppc-dev, linux-kernel, kvm-ppc, kvm, David Gibson

On Mon, Feb 11, 2013 at 11:12:41PM +1100, aik@ozlabs.ru wrote:

> +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> +	/* 	    liobn, stt, stt->window_size); */
> +	if (ioba >= stt->window_size) {
> +		pr_err("%s failed on ioba=%lx\n", __func__, ioba);

Doesn't this give the guest a way to spam the host logs?  And in fact
printk in real mode is potentially problematic.  I would just leave
out this statement.

> +		return H_PARAMETER;
> +	}
> +
> +	page = stt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);

I would like to see an explanation of why we are confident that
page_address() will work correctly in real mode, across all the
combinations of config options that we can have for a ppc64 book3s
kernel.

> +
> +	/* FIXME: Need to validate the TCE itself */
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +
> +	return H_SUCCESS;
> +}
> +
> +/*
> + * Real mode handlers
>   */
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
>  	struct kvmppc_spapr_tce_table *stt;
>  
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return emulated_h_put_tce(stt, ioba, tce);
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	long i, ret = 0;
> +	unsigned long *tces;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
>  
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> +	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
> +	if (!tces)
> +		return H_TOO_HARD;
>  
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);

So, tces is a pointer to somewhere inside a real page.  Did we check
somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
I missed it.  If we didn't, then we probably should check and do
something about it.

>  
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> +	return ret;
> +}
>  
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> -	}
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	long i, ret = 0;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tce_value);
> +
> +	return ret;
> +}
> +
> +/*
> + * Virtual mode handlers
> + */
> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	/* At the moment emulated IO is handled the same way */
> +	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +}
> +
> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	unsigned long *tces;
> +	long ret = 0, i;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
> +	if (!tces)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);

Same comment here about tces[i] overflowing a page boundary.

> +
> +	return ret;
> +}
> +
> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	/* At the moment emulated IO is handled the same way */
> +	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 71d0c90..13c8436 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  					kvmppc_get_gpr(vcpu, 5),
>  					kvmppc_get_gpr(vcpu, 6));
>  		break;
> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
[snip]
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 70739a0..95614c7 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_PPC_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6e5d4b..26e2b271 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_IRQFD_RESAMPLE 82
>  #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
>  #define KVM_CAP_PPC_HTAB_FD 84
> +#define KVM_CAP_PPC_MULTITCE 87

The capability should be described in
Documentation/virtual/kvm/api.txt.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
@ 2013-02-15  3:24     ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:24 UTC (permalink / raw)
  To: aik
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, linuxppc-dev, David Gibson

On Mon, Feb 11, 2013 at 11:12:41PM +1100, aik@ozlabs.ru wrote:

> +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> +	/* 	    liobn, stt, stt->window_size); */
> +	if (ioba >= stt->window_size) {
> +		pr_err("%s failed on ioba=%lx\n", __func__, ioba);

Doesn't this give the guest a way to spam the host logs?  And in fact
printk in real mode is potentially problematic.  I would just leave
out this statement.

> +		return H_PARAMETER;
> +	}
> +
> +	page = stt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);

I would like to see an explanation of why we are confident that
page_address() will work correctly in real mode, across all the
combinations of config options that we can have for a ppc64 book3s
kernel.

> +
> +	/* FIXME: Need to validate the TCE itself */
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +
> +	return H_SUCCESS;
> +}
> +
> +/*
> + * Real mode handlers
>   */
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
>  	struct kvmppc_spapr_tce_table *stt;
>  
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return emulated_h_put_tce(stt, ioba, tce);
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	long i, ret = 0;
> +	unsigned long *tces;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
>  
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> +	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
> +	if (!tces)
> +		return H_TOO_HARD;
>  
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);

So, tces is a pointer to somewhere inside a real page.  Did we check
somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
I missed it.  If we didn't, then we probably should check and do
something about it.

>  
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> +	return ret;
> +}
>  
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> -	}
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	long i, ret = 0;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tce_value);
> +
> +	return ret;
> +}
> +
> +/*
> + * Virtual mode handlers
> + */
> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	/* At the moment emulated IO is handled the same way */
> +	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +}
> +
> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	unsigned long *tces;
> +	long ret = 0, i;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
> +	if (!tces)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);

Same comment here about tces[i] overflowing a page boundary.

> +
> +	return ret;
> +}
> +
> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	/* At the moment emulated IO is handled the same way */
> +	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 71d0c90..13c8436 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  					kvmppc_get_gpr(vcpu, 5),
>  					kvmppc_get_gpr(vcpu, 6));
>  		break;
> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
[snip]
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 70739a0..95614c7 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_PPC_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6e5d4b..26e2b271 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_IRQFD_RESAMPLE 82
>  #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
>  #define KVM_CAP_PPC_HTAB_FD 84
> +#define KVM_CAP_PPC_MULTITCE 87

The capability should be described in
Documentation/virtual/kvm/api.txt.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
@ 2013-02-15  3:24     ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:24 UTC (permalink / raw)
  To: aik
  Cc: Benjamin Herrenschmidt, Alexander Graf, Michael Ellerman,
	linuxppc-dev, linux-kernel, kvm-ppc, kvm, David Gibson

On Mon, Feb 11, 2013 at 11:12:41PM +1100, aik@ozlabs.ru wrote:

> +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> +	/* 	    liobn, stt, stt->window_size); */
> +	if (ioba >= stt->window_size) {
> +		pr_err("%s failed on ioba=%lx\n", __func__, ioba);

Doesn't this give the guest a way to spam the host logs?  And in fact
printk in real mode is potentially problematic.  I would just leave
out this statement.

> +		return H_PARAMETER;
> +	}
> +
> +	page = stt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);

I would like to see an explanation of why we are confident that
page_address() will work correctly in real mode, across all the
combinations of config options that we can have for a ppc64 book3s
kernel.

> +
> +	/* FIXME: Need to validate the TCE itself */
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +
> +	return H_SUCCESS;
> +}
> +
> +/*
> + * Real mode handlers
>   */
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
>  	struct kvmppc_spapr_tce_table *stt;
>  
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return emulated_h_put_tce(stt, ioba, tce);
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	long i, ret = 0;
> +	unsigned long *tces;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
>  
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn = liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> +	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
> +	if (!tces)
> +		return H_TOO_HARD;
>  
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);

So, tces is a pointer to somewhere inside a real page.  Did we check
somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
I missed it.  If we didn't, then we probably should check and do
something about it.

>  
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> +	return ret;
> +}
>  
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> -	}
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	long i, ret = 0;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tce_value);
> +
> +	return ret;
> +}
> +
> +/*
> + * Virtual mode handlers
> + */
> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	/* At the moment emulated IO is handled the same way */
> +	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +}
> +
> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +	unsigned long *tces;
> +	long ret = 0, i;
> +
> +	stt = find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!stt)
> +		return H_TOO_HARD;
> +
> +	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
> +	if (!tces)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);

Same comment here about tces[i] overflowing a page boundary.

> +
> +	return ret;
> +}
> +
> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	/* At the moment emulated IO is handled the same way */
> +	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 71d0c90..13c8436 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  					kvmppc_get_gpr(vcpu, 5),
>  					kvmppc_get_gpr(vcpu, 6));
>  		break;
> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6));
> +		if (ret = H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret = H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					        kvmppc_get_gpr(vcpu, 5),
> +					        kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret = H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
[snip]
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 70739a0..95614c7 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_PPC_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6e5d4b..26e2b271 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_IRQFD_RESAMPLE 82
>  #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
>  #define KVM_CAP_PPC_HTAB_FD 84
> +#define KVM_CAP_PPC_MULTITCE 87

The capability should be described in
Documentation/virtual/kvm/api.txt.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] powerpc: preparing to support real mode optimization
  2013-02-11 12:12   ` aik
  (?)
@ 2013-02-15  3:37     ` Paul Mackerras
  -1 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:37 UTC (permalink / raw)
  To: aik
  Cc: Alexander Graf, Michael Ellerman, linuxppc-dev, linux-kernel,
	kvm-ppc, kvm, David Gibson

On Mon, Feb 11, 2013 at 11:12:42PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> he current VFIO-on-POWER implementation supports only user mode
> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
> However this approach is really slow in really fast hardware so
> it is better to be moved to the real mode.
> 
> The patch adds an API to increment/decrement page counter as
> get_user_pages API used for user mode mapping does not work
> in the real mode.
> 
> CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEN are supported.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> ---

The names are slightly odd, in that they include "vmemmap_" but exist
and work in the flatmem case as well.  Apart from that...

Reviewed-by: Paul Mackerras <paulus@samba.org>

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] powerpc: preparing to support real mode optimization
@ 2013-02-15  3:37     ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:37 UTC (permalink / raw)
  To: aik
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, linuxppc-dev, David Gibson

On Mon, Feb 11, 2013 at 11:12:42PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> he current VFIO-on-POWER implementation supports only user mode
> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
> However this approach is really slow in really fast hardware so
> it is better to be moved to the real mode.
> 
> The patch adds an API to increment/decrement page counter as
> get_user_pages API used for user mode mapping does not work
> in the real mode.
> 
> CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEN are supported.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> ---

The names are slightly odd, in that they include "vmemmap_" but exist
and work in the flatmem case as well.  Apart from that...

Reviewed-by: Paul Mackerras <paulus@samba.org>

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] powerpc: preparing to support real mode optimization
@ 2013-02-15  3:37     ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:37 UTC (permalink / raw)
  To: aik
  Cc: Alexander Graf, Michael Ellerman, linuxppc-dev, linux-kernel,
	kvm-ppc, kvm, David Gibson

On Mon, Feb 11, 2013 at 11:12:42PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> he current VFIO-on-POWER implementation supports only user mode
> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
> However this approach is really slow in really fast hardware so
> it is better to be moved to the real mode.
> 
> The patch adds an API to increment/decrement page counter as
> get_user_pages API used for user mode mapping does not work
> in the real mode.
> 
> CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEN are supported.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> ---

The names are slightly odd, in that they include "vmemmap_" but exist
and work in the flatmem case as well.  Apart from that...

Reviewed-by: Paul Mackerras <paulus@samba.org>

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] vfio powerpc: added real mode support
  2013-02-11 12:12   ` aik
  (?)
@ 2013-02-15  3:54     ` Paul Mackerras
  -1 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:54 UTC (permalink / raw)
  To: aik
  Cc: Benjamin Herrenschmidt, Alexander Graf, Michael Ellerman,
	linuxppc-dev, linux-kernel, kvm-ppc, kvm, David Gibson

On Mon, Feb 11, 2013 at 11:12:43PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> The patch allows the host kernel to handle H_PUT_TCE request
> without involving QEMU in it what should save time on switching
> from the kernel to QEMU and back.
> 
> The patch adds an IOMMU ID parameter into the KVM_CAP_SPAPR_TCE ioctl,
> QEMU needs to be fixed to support that.
> 
> At the moment H_PUT_TCE is processed in the virtual mode as the page
> to be mapped may not be present in the RAM so paging may be involved as
> it can be done from the virtual mode only.
> 
> Tests show that this patch increases tranmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
[snip]
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index b4fdabc..acb9cdc 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -47,6 +47,8 @@
>  #include <asm/fadump.h>
>  #include <asm/vio.h>
>  #include <asm/tce.h>
> +#include <asm/kvm_book3s_64.h>
> +#include <asm/page.h>
>  
>  #define DBG(...)
>  
> @@ -727,6 +729,7 @@ void iommu_register_group(struct iommu_table * tbl,
>  		return;
>  	}
>  	tbl->it_group = grp;
> +	INIT_LIST_HEAD(&tbl->it_hugepages);
>  	iommu_group_set_iommudata(grp, tbl, group_release);
>  	iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
>  			domain_number, pe_num));
> @@ -906,6 +909,83 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
>  }
>  
> +/*
> + * The KVM guest can be backed with 16MB pages (qemu switch
> + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
> + * In this case, we cannot do page counting from the real mode
> + * as the compound pages are used - they are linked in a list
> + * with pointers as virtual addresses which are inaccessible
> + * in real mode.
> + *
> + * The code below keeps a 16MB pages list and uses page struct
> + * in real mode if it is already locked in RAM and inserted into
> + * the list or switches to the virtual mode where it can be
> + * handled in a usual manner.
> + */
> +struct iommu_kvmppc_hugepages {
> +	struct list_head list;
> +	pte_t pte;		/* Huge page PTE */
> +	unsigned long pa;	/* Base phys address used as a real TCE */
> +	struct page *page;	/* page struct of the very first subpage */
> +	unsigned long size;	/* Huge page size (always 16MB at the moment) */
> +	bool dirty;		/* Dirty bit */
> +};
> +
> +static struct iommu_kvmppc_hugepages *find_hp_by_pte(struct iommu_table *tbl,
> +		pte_t pte)
> +{
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	list_for_each_entry(hp, &tbl->it_hugepages, list) {
> +		if (hp->pte == pte)
> +			return hp;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct iommu_kvmppc_hugepages *find_hp_by_pa(struct iommu_table *tbl,
> +		unsigned long pa)
> +{
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	list_for_each_entry(hp, &tbl->it_hugepages, list) {
> +		if ((hp->pa <= pa) && (pa < hp->pa + hp->size))
> +			return hp;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct iommu_kvmppc_hugepages *add_hp(struct iommu_table *tbl,
> +		pte_t pte, unsigned long va, unsigned long pg_size)
> +{
> +	int ret;
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
> +	if (!hp)
> +		return NULL;
> +
> +	hp->pte = pte;
> +	va = va & ~(pg_size - 1);
> +	ret = get_user_pages_fast(va, 1, true/*write*/, &hp->page);
> +	if ((ret != 1) || !hp->page) {
> +		kfree(hp);
> +		return NULL;
> +	}
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	hp->pa = __pa((unsigned long) page_address(hp->page));
> +
> +	hp->size = pg_size;
> +
> +	list_add(&hp->list, &tbl->it_hugepages);
> +
> +	return hp;
> +}

I don't see any locking here.  What stops one cpu doing add_hp() from
racing with another doing find_hp_by_pte() or find_hp_by_pa()?

[snip]
> @@ -1021,6 +1123,24 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
>  }
>  EXPORT_SYMBOL_GPL(iommu_clear_tce_user_mode);
>  
> +long iommu_clear_tce_real_mode(struct iommu_table *tbl, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	long ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (!ret)
> +		ret = clear_tce(tbl, true, entry, npages);
> +
> +	if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
> +				__func__, ioba, tce_value, ret);

Better to avoid printk in real mode if at all possible, particularly
if they're guest-triggerable.

[snip]
> @@ -195,15 +225,43 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> +	if (stt->virtmode_only)
> +		return H_TOO_HARD;
> +
>  	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
>  	if (!tces)
>  		return H_TOO_HARD;
>  
>  	/* Emulated IO */
> -	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> -		ret = emulated_h_put_tce(stt, ioba, tces[i]);
> +	if (!stt->tbl) {
> +		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +			ret = emulated_h_put_tce(stt, ioba, tces[i]);
> +
> +		return ret;
> +	}
> +
> +	/* VFIO IOMMU */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
> +		unsigned long hpa, pg_size = 0;
> +		pte_t pte = 0;
> +
> +		hpa = get_real_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
> +				&pte, &pg_size);
> +		if (!hpa)
> +			return H_TOO_HARD;
> +
> +		ret = iommu_put_tce_real_mode(stt->tbl,
> +				ioba, hpa, pte, pg_size);

If we get a failure part-way through, should we go back and remove the
entries we put in?

[snip]
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 26e2b271..3727ea6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -863,6 +863,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>  /* Available with KVM_CAP_PPC_HTAB_FD */
>  #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)

This needs an entry in Documentation/virtual/kvm/api.txt.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] vfio powerpc: added real mode support
@ 2013-02-15  3:54     ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:54 UTC (permalink / raw)
  To: aik
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, linuxppc-dev, David Gibson

On Mon, Feb 11, 2013 at 11:12:43PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> The patch allows the host kernel to handle H_PUT_TCE request
> without involving QEMU in it what should save time on switching
> from the kernel to QEMU and back.
> 
> The patch adds an IOMMU ID parameter into the KVM_CAP_SPAPR_TCE ioctl,
> QEMU needs to be fixed to support that.
> 
> At the moment H_PUT_TCE is processed in the virtual mode as the page
> to be mapped may not be present in the RAM so paging may be involved as
> it can be done from the virtual mode only.
> 
> Tests show that this patch increases tranmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
[snip]
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index b4fdabc..acb9cdc 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -47,6 +47,8 @@
>  #include <asm/fadump.h>
>  #include <asm/vio.h>
>  #include <asm/tce.h>
> +#include <asm/kvm_book3s_64.h>
> +#include <asm/page.h>
>  
>  #define DBG(...)
>  
> @@ -727,6 +729,7 @@ void iommu_register_group(struct iommu_table * tbl,
>  		return;
>  	}
>  	tbl->it_group = grp;
> +	INIT_LIST_HEAD(&tbl->it_hugepages);
>  	iommu_group_set_iommudata(grp, tbl, group_release);
>  	iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
>  			domain_number, pe_num));
> @@ -906,6 +909,83 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
>  }
>  
> +/*
> + * The KVM guest can be backed with 16MB pages (qemu switch
> + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
> + * In this case, we cannot do page counting from the real mode
> + * as the compound pages are used - they are linked in a list
> + * with pointers as virtual addresses which are inaccessible
> + * in real mode.
> + *
> + * The code below keeps a 16MB pages list and uses page struct
> + * in real mode if it is already locked in RAM and inserted into
> + * the list or switches to the virtual mode where it can be
> + * handled in a usual manner.
> + */
> +struct iommu_kvmppc_hugepages {
> +	struct list_head list;
> +	pte_t pte;		/* Huge page PTE */
> +	unsigned long pa;	/* Base phys address used as a real TCE */
> +	struct page *page;	/* page struct of the very first subpage */
> +	unsigned long size;	/* Huge page size (always 16MB at the moment) */
> +	bool dirty;		/* Dirty bit */
> +};
> +
> +static struct iommu_kvmppc_hugepages *find_hp_by_pte(struct iommu_table *tbl,
> +		pte_t pte)
> +{
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	list_for_each_entry(hp, &tbl->it_hugepages, list) {
> +		if (hp->pte == pte)
> +			return hp;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct iommu_kvmppc_hugepages *find_hp_by_pa(struct iommu_table *tbl,
> +		unsigned long pa)
> +{
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	list_for_each_entry(hp, &tbl->it_hugepages, list) {
> +		if ((hp->pa <= pa) && (pa < hp->pa + hp->size))
> +			return hp;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct iommu_kvmppc_hugepages *add_hp(struct iommu_table *tbl,
> +		pte_t pte, unsigned long va, unsigned long pg_size)
> +{
> +	int ret;
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
> +	if (!hp)
> +		return NULL;
> +
> +	hp->pte = pte;
> +	va = va & ~(pg_size - 1);
> +	ret = get_user_pages_fast(va, 1, true/*write*/, &hp->page);
> +	if ((ret != 1) || !hp->page) {
> +		kfree(hp);
> +		return NULL;
> +	}
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	hp->pa = __pa((unsigned long) page_address(hp->page));
> +
> +	hp->size = pg_size;
> +
> +	list_add(&hp->list, &tbl->it_hugepages);
> +
> +	return hp;
> +}

I don't see any locking here.  What stops one cpu doing add_hp() from
racing with another doing find_hp_by_pte() or find_hp_by_pa()?

[snip]
> @@ -1021,6 +1123,24 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
>  }
>  EXPORT_SYMBOL_GPL(iommu_clear_tce_user_mode);
>  
> +long iommu_clear_tce_real_mode(struct iommu_table *tbl, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	long ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (!ret)
> +		ret = clear_tce(tbl, true, entry, npages);
> +
> +	if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
> +				__func__, ioba, tce_value, ret);

Better to avoid printk in real mode if at all possible, particularly
if they're guest-triggerable.

[snip]
> @@ -195,15 +225,43 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> +	if (stt->virtmode_only)
> +		return H_TOO_HARD;
> +
>  	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
>  	if (!tces)
>  		return H_TOO_HARD;
>  
>  	/* Emulated IO */
> -	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> -		ret = emulated_h_put_tce(stt, ioba, tces[i]);
> +	if (!stt->tbl) {
> +		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +			ret = emulated_h_put_tce(stt, ioba, tces[i]);
> +
> +		return ret;
> +	}
> +
> +	/* VFIO IOMMU */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
> +		unsigned long hpa, pg_size = 0;
> +		pte_t pte = 0;
> +
> +		hpa = get_real_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
> +				&pte, &pg_size);
> +		if (!hpa)
> +			return H_TOO_HARD;
> +
> +		ret = iommu_put_tce_real_mode(stt->tbl,
> +				ioba, hpa, pte, pg_size);

If we get a failure part-way through, should we go back and remove the
entries we put in?

[snip]
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 26e2b271..3727ea6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -863,6 +863,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>  /* Available with KVM_CAP_PPC_HTAB_FD */
>  #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)

This needs an entry in Documentation/virtual/kvm/api.txt.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] vfio powerpc: added real mode support
@ 2013-02-15  3:54     ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2013-02-15  3:54 UTC (permalink / raw)
  To: aik
  Cc: Benjamin Herrenschmidt, Alexander Graf, Michael Ellerman,
	linuxppc-dev, linux-kernel, kvm-ppc, kvm, David Gibson

On Mon, Feb 11, 2013 at 11:12:43PM +1100, aik@ozlabs.ru wrote:
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> The patch allows the host kernel to handle H_PUT_TCE request
> without involving QEMU in it what should save time on switching
> from the kernel to QEMU and back.
> 
> The patch adds an IOMMU ID parameter into the KVM_CAP_SPAPR_TCE ioctl,
> QEMU needs to be fixed to support that.
> 
> At the moment H_PUT_TCE is processed in the virtual mode as the page
> to be mapped may not be present in the RAM so paging may be involved as
> it can be done from the virtual mode only.
> 
> Tests show that this patch increases tranmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
[snip]
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index b4fdabc..acb9cdc 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -47,6 +47,8 @@
>  #include <asm/fadump.h>
>  #include <asm/vio.h>
>  #include <asm/tce.h>
> +#include <asm/kvm_book3s_64.h>
> +#include <asm/page.h>
>  
>  #define DBG(...)
>  
> @@ -727,6 +729,7 @@ void iommu_register_group(struct iommu_table * tbl,
>  		return;
>  	}
>  	tbl->it_group = grp;
> +	INIT_LIST_HEAD(&tbl->it_hugepages);
>  	iommu_group_set_iommudata(grp, tbl, group_release);
>  	iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
>  			domain_number, pe_num));
> @@ -906,6 +909,83 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
>  }
>  
> +/*
> + * The KVM guest can be backed with 16MB pages (qemu switch
> + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
> + * In this case, we cannot do page counting from the real mode
> + * as the compound pages are used - they are linked in a list
> + * with pointers as virtual addresses which are inaccessible
> + * in real mode.
> + *
> + * The code below keeps a 16MB pages list and uses page struct
> + * in real mode if it is already locked in RAM and inserted into
> + * the list or switches to the virtual mode where it can be
> + * handled in a usual manner.
> + */
> +struct iommu_kvmppc_hugepages {
> +	struct list_head list;
> +	pte_t pte;		/* Huge page PTE */
> +	unsigned long pa;	/* Base phys address used as a real TCE */
> +	struct page *page;	/* page struct of the very first subpage */
> +	unsigned long size;	/* Huge page size (always 16MB at the moment) */
> +	bool dirty;		/* Dirty bit */
> +};
> +
> +static struct iommu_kvmppc_hugepages *find_hp_by_pte(struct iommu_table *tbl,
> +		pte_t pte)
> +{
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	list_for_each_entry(hp, &tbl->it_hugepages, list) {
> +		if (hp->pte = pte)
> +			return hp;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct iommu_kvmppc_hugepages *find_hp_by_pa(struct iommu_table *tbl,
> +		unsigned long pa)
> +{
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	list_for_each_entry(hp, &tbl->it_hugepages, list) {
> +		if ((hp->pa <= pa) && (pa < hp->pa + hp->size))
> +			return hp;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct iommu_kvmppc_hugepages *add_hp(struct iommu_table *tbl,
> +		pte_t pte, unsigned long va, unsigned long pg_size)
> +{
> +	int ret;
> +	struct iommu_kvmppc_hugepages *hp;
> +
> +	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
> +	if (!hp)
> +		return NULL;
> +
> +	hp->pte = pte;
> +	va = va & ~(pg_size - 1);
> +	ret = get_user_pages_fast(va, 1, true/*write*/, &hp->page);
> +	if ((ret != 1) || !hp->page) {
> +		kfree(hp);
> +		return NULL;
> +	}
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	hp->pa = __pa((unsigned long) page_address(hp->page));
> +
> +	hp->size = pg_size;
> +
> +	list_add(&hp->list, &tbl->it_hugepages);
> +
> +	return hp;
> +}

I don't see any locking here.  What stops one cpu doing add_hp() from
racing with another doing find_hp_by_pte() or find_hp_by_pa()?

[snip]
> @@ -1021,6 +1123,24 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
>  }
>  EXPORT_SYMBOL_GPL(iommu_clear_tce_user_mode);
>  
> +long iommu_clear_tce_real_mode(struct iommu_table *tbl, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	long ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (!ret)
> +		ret = clear_tce(tbl, true, entry, npages);
> +
> +	if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
> +				__func__, ioba, tce_value, ret);

Better to avoid printk in real mode if at all possible, particularly
if they're guest-triggerable.

[snip]
> @@ -195,15 +225,43 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (!stt)
>  		return H_TOO_HARD;
>  
> +	if (stt->virtmode_only)
> +		return H_TOO_HARD;
> +
>  	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
>  	if (!tces)
>  		return H_TOO_HARD;
>  
>  	/* Emulated IO */
> -	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> -		ret = emulated_h_put_tce(stt, ioba, tces[i]);
> +	if (!stt->tbl) {
> +		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
> +			ret = emulated_h_put_tce(stt, ioba, tces[i]);
> +
> +		return ret;
> +	}
> +
> +	/* VFIO IOMMU */
> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
> +		unsigned long hpa, pg_size = 0;
> +		pte_t pte = 0;
> +
> +		hpa = get_real_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
> +				&pte, &pg_size);
> +		if (!hpa)
> +			return H_TOO_HARD;
> +
> +		ret = iommu_put_tce_real_mode(stt->tbl,
> +				ioba, hpa, pte, pg_size);

If we get a failure part-way through, should we go back and remove the
entries we put in?

[snip]
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 26e2b271..3727ea6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -863,6 +863,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>  /* Available with KVM_CAP_PPC_HTAB_FD */
>  #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)

This needs an entry in Documentation/virtual/kvm/api.txt.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
  2013-02-15  3:24     ` Paul Mackerras
  (?)
@ 2013-02-18  8:14       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 27+ messages in thread
From: Alexey Kardashevskiy @ 2013-02-18  8:14 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Benjamin Herrenschmidt, Alexander Graf, Michael Ellerman,
	linuxppc-dev, linux-kernel, kvm-ppc, kvm, David Gibson

On 15/02/13 14:24, Paul Mackerras wrote:
> On Mon, Feb 11, 2013 at 11:12:41PM +1100, aik@ozlabs.ru wrote:
>
>> +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>> +		unsigned long ioba, unsigned long tce)
>> +{
>> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> +	struct page *page;
>> +	u64 *tbl;
>> +
>> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
>> +	/* 	    liobn, stt, stt->window_size); */
>> +	if (ioba >= stt->window_size) {
>> +		pr_err("%s failed on ioba=%lx\n", __func__, ioba);
>
> Doesn't this give the guest a way to spam the host logs?  And in fact
> printk in real mode is potentially problematic.  I would just leave
> out this statement.
>
>> +		return H_PARAMETER;
>> +	}
>> +
>> +	page = stt->pages[idx / TCES_PER_PAGE];
>> +	tbl = (u64 *)page_address(page);
>
> I would like to see an explanation of why we are confident that
> page_address() will work correctly in real mode, across all the
> combinations of config options that we can have for a ppc64 book3s
> kernel.

It was there before this patch, I just moved it so I would think it has 
been explained before :)

There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled.
CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not 
enabled on PPC64 either.

So this definition is supposed to work on PPC64:
#define page_address(page) lowmem_page_address(page)

where lowmem_page_address() is arithmetic operation on a page struct address:
static __always_inline void *lowmem_page_address(const struct page *page)
{
	return __va(PFN_PHYS(page_to_pfn(page)));
}

PPC32 will use page_address() from mm/highmem.c, I need some lesson about 
memory layout in 32bit but for now I cannot see how it can possibly fail here.



>> +
>> +	/* FIXME: Need to validate the TCE itself */
>> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> +	tbl[idx % TCES_PER_PAGE] = tce;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * Real mode handlers
>>    */
>>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>   		      unsigned long ioba, unsigned long tce)
>>   {
>> -	struct kvm *kvm = vcpu->kvm;
>>   	struct kvmppc_spapr_tce_table *stt;
>>
>> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>> -	/* 	    liobn, ioba, tce); */
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	return emulated_h_put_tce(stt, ioba, tce);
>> +}
>> +
>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list,	unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	long i, ret = 0;
>> +	unsigned long *tces;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>>
>> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
>> -		if (stt->liobn == liobn) {
>> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> -			struct page *page;
>> -			u64 *tbl;
>> +	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
>> +	if (!tces)
>> +		return H_TOO_HARD;
>>
>> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
>> -			/* 	    liobn, stt, stt->window_size); */
>> -			if (ioba >= stt->window_size)
>> -				return H_PARAMETER;
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);
>
> So, tces is a pointer to somewhere inside a real page.  Did we check
> somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
> I missed it.  If we didn't, then we probably should check and do
> something about it.
>
>>
>> -			page = stt->pages[idx / TCES_PER_PAGE];
>> -			tbl = (u64 *)page_address(page);
>> +	return ret;
>> +}
>>
>> -			/* FIXME: Need to validate the TCE itself */
>> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> -			tbl[idx % TCES_PER_PAGE] = tce;
>> -			return H_SUCCESS;
>> -		}
>> -	}
>> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	long i, ret = 0;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>>
>> -	/* Didn't find the liobn, punt it to userspace */
>> -	return H_TOO_HARD;
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tce_value);
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Virtual mode handlers
>> + */
>> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce)
>> +{
>> +	/* At the moment emulated IO is handled the same way */
>> +	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>> +}
>> +
>> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	unsigned long *tces;
>> +	long ret = 0, i;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>> +
>> +	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
>> +	if (!tces)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);
>
> Same comment here about tces[i] overflowing a page boundary.
>
>> +
>> +	return ret;
>> +}
>> +
>> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	/* At the moment emulated IO is handled the same way */
>> +	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>>   }
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 71d0c90..13c8436 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>   					kvmppc_get_gpr(vcpu, 5),
>>   					kvmppc_get_gpr(vcpu, 6));
>>   		break;
>> +	case H_PUT_TCE:
>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_PUT_TCE_INDIRECT:
>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_STUFF_TCE:
>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>>   	default:
>>   		return RESUME_HOST;
>>   	}
> [snip]
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 70739a0..95614c7 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>   		r = 1;
>>   		break;
>>   #endif
>> +	case KVM_CAP_PPC_MULTITCE:
>> +		r = 1;
>> +		break;
>>   	default:
>>   		r = 0;
>>   		break;
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index e6e5d4b..26e2b271 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
>>   #define KVM_CAP_IRQFD_RESAMPLE 82
>>   #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
>>   #define KVM_CAP_PPC_HTAB_FD 84
>> +#define KVM_CAP_PPC_MULTITCE 87
>
> The capability should be described in
> Documentation/virtual/kvm/api.txt.

Is it enough description?

===
4.79 KVM_CAP_PPC_MULTITCE

Architectures: ppc
Parameters: none
Returns: 0 on success; -1 on error

This capability enables the guest to put/remove multiple TCE entries
per hypercall which significanly accelerates DMA operations for PPC KVM
guests.

When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
expected to occur rather than H_PUT_TCE which supports only one TCE entry
per call.
===


-- 
Alexey

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
@ 2013-02-18  8:14       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 27+ messages in thread
From: Alexey Kardashevskiy @ 2013-02-18  8:14 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, linuxppc-dev, David Gibson

On 15/02/13 14:24, Paul Mackerras wrote:
> On Mon, Feb 11, 2013 at 11:12:41PM +1100, aik@ozlabs.ru wrote:
>
>> +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>> +		unsigned long ioba, unsigned long tce)
>> +{
>> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> +	struct page *page;
>> +	u64 *tbl;
>> +
>> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
>> +	/* 	    liobn, stt, stt->window_size); */
>> +	if (ioba >= stt->window_size) {
>> +		pr_err("%s failed on ioba=%lx\n", __func__, ioba);
>
> Doesn't this give the guest a way to spam the host logs?  And in fact
> printk in real mode is potentially problematic.  I would just leave
> out this statement.
>
>> +		return H_PARAMETER;
>> +	}
>> +
>> +	page = stt->pages[idx / TCES_PER_PAGE];
>> +	tbl = (u64 *)page_address(page);
>
> I would like to see an explanation of why we are confident that
> page_address() will work correctly in real mode, across all the
> combinations of config options that we can have for a ppc64 book3s
> kernel.

It was there before this patch, I just moved it so I would think it has 
been explained before :)

There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled.
CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not 
enabled on PPC64 either.

So this definition is supposed to work on PPC64:
#define page_address(page) lowmem_page_address(page)

where lowmem_page_address() is arithmetic operation on a page struct address:
static __always_inline void *lowmem_page_address(const struct page *page)
{
	return __va(PFN_PHYS(page_to_pfn(page)));
}

PPC32 will use page_address() from mm/highmem.c, I need some lesson about 
memory layout in 32bit but for now I cannot see how it can possibly fail here.



>> +
>> +	/* FIXME: Need to validate the TCE itself */
>> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> +	tbl[idx % TCES_PER_PAGE] = tce;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * Real mode handlers
>>    */
>>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>   		      unsigned long ioba, unsigned long tce)
>>   {
>> -	struct kvm *kvm = vcpu->kvm;
>>   	struct kvmppc_spapr_tce_table *stt;
>>
>> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>> -	/* 	    liobn, ioba, tce); */
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	return emulated_h_put_tce(stt, ioba, tce);
>> +}
>> +
>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list,	unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	long i, ret = 0;
>> +	unsigned long *tces;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>>
>> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
>> -		if (stt->liobn == liobn) {
>> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> -			struct page *page;
>> -			u64 *tbl;
>> +	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
>> +	if (!tces)
>> +		return H_TOO_HARD;
>>
>> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
>> -			/* 	    liobn, stt, stt->window_size); */
>> -			if (ioba >= stt->window_size)
>> -				return H_PARAMETER;
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);
>
> So, tces is a pointer to somewhere inside a real page.  Did we check
> somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
> I missed it.  If we didn't, then we probably should check and do
> something about it.
>
>>
>> -			page = stt->pages[idx / TCES_PER_PAGE];
>> -			tbl = (u64 *)page_address(page);
>> +	return ret;
>> +}
>>
>> -			/* FIXME: Need to validate the TCE itself */
>> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> -			tbl[idx % TCES_PER_PAGE] = tce;
>> -			return H_SUCCESS;
>> -		}
>> -	}
>> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	long i, ret = 0;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>>
>> -	/* Didn't find the liobn, punt it to userspace */
>> -	return H_TOO_HARD;
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tce_value);
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Virtual mode handlers
>> + */
>> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce)
>> +{
>> +	/* At the moment emulated IO is handled the same way */
>> +	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>> +}
>> +
>> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	unsigned long *tces;
>> +	long ret = 0, i;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>> +
>> +	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
>> +	if (!tces)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);
>
> Same comment here about tces[i] overflowing a page boundary.
>
>> +
>> +	return ret;
>> +}
>> +
>> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	/* At the moment emulated IO is handled the same way */
>> +	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>>   }
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 71d0c90..13c8436 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>   					kvmppc_get_gpr(vcpu, 5),
>>   					kvmppc_get_gpr(vcpu, 6));
>>   		break;
>> +	case H_PUT_TCE:
>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_PUT_TCE_INDIRECT:
>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_STUFF_TCE:
>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret == H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>>   	default:
>>   		return RESUME_HOST;
>>   	}
> [snip]
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 70739a0..95614c7 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>   		r = 1;
>>   		break;
>>   #endif
>> +	case KVM_CAP_PPC_MULTITCE:
>> +		r = 1;
>> +		break;
>>   	default:
>>   		r = 0;
>>   		break;
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index e6e5d4b..26e2b271 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
>>   #define KVM_CAP_IRQFD_RESAMPLE 82
>>   #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
>>   #define KVM_CAP_PPC_HTAB_FD 84
>> +#define KVM_CAP_PPC_MULTITCE 87
>
> The capability should be described in
> Documentation/virtual/kvm/api.txt.

Is it enough description?

===
4.79 KVM_CAP_PPC_MULTITCE

Architectures: ppc
Parameters: none
Returns: 0 on success; -1 on error

This capability enables the guest to put/remove multiple TCE entries
per hypercall which significanly accelerates DMA operations for PPC KVM
guests.

When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
expected to occur rather than H_PUT_TCE which supports only one TCE entry
per call.
===


-- 
Alexey

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
@ 2013-02-18  8:14       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 27+ messages in thread
From: Alexey Kardashevskiy @ 2013-02-18  8:14 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Benjamin Herrenschmidt, Alexander Graf, Michael Ellerman,
	linuxppc-dev, linux-kernel, kvm-ppc, kvm, David Gibson

On 15/02/13 14:24, Paul Mackerras wrote:
> On Mon, Feb 11, 2013 at 11:12:41PM +1100, aik@ozlabs.ru wrote:
>
>> +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>> +		unsigned long ioba, unsigned long tce)
>> +{
>> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> +	struct page *page;
>> +	u64 *tbl;
>> +
>> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
>> +	/* 	    liobn, stt, stt->window_size); */
>> +	if (ioba >= stt->window_size) {
>> +		pr_err("%s failed on ioba=%lx\n", __func__, ioba);
>
> Doesn't this give the guest a way to spam the host logs?  And in fact
> printk in real mode is potentially problematic.  I would just leave
> out this statement.
>
>> +		return H_PARAMETER;
>> +	}
>> +
>> +	page = stt->pages[idx / TCES_PER_PAGE];
>> +	tbl = (u64 *)page_address(page);
>
> I would like to see an explanation of why we are confident that
> page_address() will work correctly in real mode, across all the
> combinations of config options that we can have for a ppc64 book3s
> kernel.

It was there before this patch, I just moved it so I would think it has 
been explained before :)

There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled.
CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not 
enabled on PPC64 either.

So this definition is supposed to work on PPC64:
#define page_address(page) lowmem_page_address(page)

where lowmem_page_address() is arithmetic operation on a page struct address:
static __always_inline void *lowmem_page_address(const struct page *page)
{
	return __va(PFN_PHYS(page_to_pfn(page)));
}

PPC32 will use page_address() from mm/highmem.c, I need some lesson about 
memory layout in 32bit but for now I cannot see how it can possibly fail here.



>> +
>> +	/* FIXME: Need to validate the TCE itself */
>> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> +	tbl[idx % TCES_PER_PAGE] = tce;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * Real mode handlers
>>    */
>>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>   		      unsigned long ioba, unsigned long tce)
>>   {
>> -	struct kvm *kvm = vcpu->kvm;
>>   	struct kvmppc_spapr_tce_table *stt;
>>
>> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>> -	/* 	    liobn, ioba, tce); */
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	return emulated_h_put_tce(stt, ioba, tce);
>> +}
>> +
>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list,	unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	long i, ret = 0;
>> +	unsigned long *tces;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>>
>> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
>> -		if (stt->liobn = liobn) {
>> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> -			struct page *page;
>> -			u64 *tbl;
>> +	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
>> +	if (!tces)
>> +		return H_TOO_HARD;
>>
>> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
>> -			/* 	    liobn, stt, stt->window_size); */
>> -			if (ioba >= stt->window_size)
>> -				return H_PARAMETER;
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);
>
> So, tces is a pointer to somewhere inside a real page.  Did we check
> somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
> I missed it.  If we didn't, then we probably should check and do
> something about it.
>
>>
>> -			page = stt->pages[idx / TCES_PER_PAGE];
>> -			tbl = (u64 *)page_address(page);
>> +	return ret;
>> +}
>>
>> -			/* FIXME: Need to validate the TCE itself */
>> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> -			tbl[idx % TCES_PER_PAGE] = tce;
>> -			return H_SUCCESS;
>> -		}
>> -	}
>> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	long i, ret = 0;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>>
>> -	/* Didn't find the liobn, punt it to userspace */
>> -	return H_TOO_HARD;
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tce_value);
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Virtual mode handlers
>> + */
>> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce)
>> +{
>> +	/* At the moment emulated IO is handled the same way */
>> +	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>> +}
>> +
>> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +	unsigned long *tces;
>> +	long ret = 0, i;
>> +
>> +	stt = find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!stt)
>> +		return H_TOO_HARD;
>> +
>> +	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
>> +	if (!tces)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		ret = emulated_h_put_tce(stt, ioba, tces[i]);
>
> Same comment here about tces[i] overflowing a page boundary.
>
>> +
>> +	return ret;
>> +}
>> +
>> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	/* At the moment emulated IO is handled the same way */
>> +	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>>   }
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 71d0c90..13c8436 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>   					kvmppc_get_gpr(vcpu, 5),
>>   					kvmppc_get_gpr(vcpu, 6));
>>   		break;
>> +	case H_PUT_TCE:
>> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6));
>> +		if (ret = H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_PUT_TCE_INDIRECT:
>> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret = H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>> +	case H_STUFF_TCE:
>> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>> +					        kvmppc_get_gpr(vcpu, 5),
>> +					        kvmppc_get_gpr(vcpu, 6),
>> +						kvmppc_get_gpr(vcpu, 7));
>> +		if (ret = H_TOO_HARD)
>> +			return RESUME_HOST;
>> +		break;
>>   	default:
>>   		return RESUME_HOST;
>>   	}
> [snip]
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 70739a0..95614c7 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>   		r = 1;
>>   		break;
>>   #endif
>> +	case KVM_CAP_PPC_MULTITCE:
>> +		r = 1;
>> +		break;
>>   	default:
>>   		r = 0;
>>   		break;
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index e6e5d4b..26e2b271 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
>>   #define KVM_CAP_IRQFD_RESAMPLE 82
>>   #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
>>   #define KVM_CAP_PPC_HTAB_FD 84
>> +#define KVM_CAP_PPC_MULTITCE 87
>
> The capability should be described in
> Documentation/virtual/kvm/api.txt.

Is it enough description?

=4.79 KVM_CAP_PPC_MULTITCE

Architectures: ppc
Parameters: none
Returns: 0 on success; -1 on error

This capability enables the guest to put/remove multiple TCE entries
per hypercall which significanly accelerates DMA operations for PPC KVM
guests.

When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
expected to occur rather than H_PUT_TCE which supports only one TCE entry
per call.
=

-- 
Alexey

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-02-18  8:14 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1360584763-21988-1-git-send-email-a>
2013-02-11 12:12 ` [PATCH 1/4] powerpc: lookup_linux_pte has been made public aik
2013-02-11 12:12   ` aik
2013-02-11 12:12   ` aik
2013-02-15  3:13   ` Paul Mackerras
2013-02-15  3:13     ` Paul Mackerras
2013-02-15  3:13     ` Paul Mackerras
2013-02-11 12:12 ` [PATCH 2/4] powerpc kvm: added multiple TCEs requests support aik
2013-02-11 12:12   ` aik
2013-02-11 12:12   ` aik
2013-02-15  3:24   ` Paul Mackerras
2013-02-15  3:24     ` Paul Mackerras
2013-02-15  3:24     ` Paul Mackerras
2013-02-18  8:14     ` Alexey Kardashevskiy
2013-02-18  8:14       ` Alexey Kardashevskiy
2013-02-18  8:14       ` Alexey Kardashevskiy
2013-02-11 12:12 ` [PATCH 3/4] powerpc: preparing to support real mode optimization aik
2013-02-11 12:12   ` aik
2013-02-11 12:12   ` aik
2013-02-15  3:37   ` Paul Mackerras
2013-02-15  3:37     ` Paul Mackerras
2013-02-15  3:37     ` Paul Mackerras
2013-02-11 12:12 ` [PATCH 4/4] vfio powerpc: added real mode support aik
2013-02-11 12:12   ` aik
2013-02-11 12:12   ` aik
2013-02-15  3:54   ` Paul Mackerras
2013-02-15  3:54     ` Paul Mackerras
2013-02-15  3:54     ` Paul Mackerras

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.