All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] KVM: PPC: IOMMU in-kernel handling
@ 2013-05-06  7:25 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

This series is supposed to accelerate IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.

The first user is VFIO however this series does not contain any VFIO related
code as the connection between VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.

Although the series compiles, it does not make sense without VFIO patches which
are posted separately.

The "iommu: Add a function to find an iommu group by id" patch has already
gone to linux-next (from iommu tree) but it is not in upstream yet so
I am including it here for the reference.


Alexey Kardashevskiy (6):
  KVM: PPC: Make lookup_linux_pte public
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  iommu: Add a function to find an iommu group by id
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt        |   43 +++
 arch/powerpc/include/asm/kvm_host.h      |    4 +
 arch/powerpc/include/asm/kvm_ppc.h       |   44 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 +
 arch/powerpc/include/uapi/asm/kvm.h      |    7 +
 arch/powerpc/kvm/book3s_64_vio.c         |  433 +++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c      |  464 ++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv.c             |   23 ++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    5 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c        |   37 ++-
 arch/powerpc/kvm/powerpc.c               |   15 +
 arch/powerpc/mm/init_64.c                |   77 ++++-
 drivers/iommu/iommu.c                    |   29 ++
 include/linux/iommu.h                    |    1 +
 include/uapi/linux/kvm.h                 |    3 +
 16 files changed, 1159 insertions(+), 36 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 0/6] KVM: PPC: IOMMU in-kernel handling
@ 2013-05-06  7:25 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	Alex Williamson, Paul Mackerras, Joerg Roedel, David Gibson

This series is supposed to accelerate IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.

The first user is VFIO however this series does not contain any VFIO related
code as the connection between VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.

Although the series compiles, it does not make sense without VFIO patches which
are posted separately.

The "iommu: Add a function to find an iommu group by id" patch has already
gone to linux-next (from iommu tree) but it is not in upstream yet so
I am including it here for the reference.


Alexey Kardashevskiy (6):
  KVM: PPC: Make lookup_linux_pte public
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  iommu: Add a function to find an iommu group by id
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt        |   43 +++
 arch/powerpc/include/asm/kvm_host.h      |    4 +
 arch/powerpc/include/asm/kvm_ppc.h       |   44 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 +
 arch/powerpc/include/uapi/asm/kvm.h      |    7 +
 arch/powerpc/kvm/book3s_64_vio.c         |  433 +++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c      |  464 ++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv.c             |   23 ++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    5 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c        |   37 ++-
 arch/powerpc/kvm/powerpc.c               |   15 +
 arch/powerpc/mm/init_64.c                |   77 ++++-
 drivers/iommu/iommu.c                    |   29 ++
 include/linux/iommu.h                    |    1 +
 include/uapi/linux/kvm.h                 |    3 +
 16 files changed, 1159 insertions(+), 36 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 0/6] KVM: PPC: IOMMU in-kernel handling
@ 2013-05-06  7:25 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

This series is supposed to accelerate IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.

The first user is VFIO however this series does not contain any VFIO related
code as the connection between VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.

Although the series compiles, it does not make sense without VFIO patches which
are posted separately.

The "iommu: Add a function to find an iommu group by id" patch has already
gone to linux-next (from iommu tree) but it is not in upstream yet so
I am including it here for the reference.


Alexey Kardashevskiy (6):
  KVM: PPC: Make lookup_linux_pte public
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  iommu: Add a function to find an iommu group by id
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt        |   43 +++
 arch/powerpc/include/asm/kvm_host.h      |    4 +
 arch/powerpc/include/asm/kvm_ppc.h       |   44 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 +
 arch/powerpc/include/uapi/asm/kvm.h      |    7 +
 arch/powerpc/kvm/book3s_64_vio.c         |  433 +++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c      |  464 ++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv.c             |   23 ++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    5 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c        |   37 ++-
 arch/powerpc/kvm/powerpc.c               |   15 +
 arch/powerpc/mm/init_64.c                |   77 ++++-
 drivers/iommu/iommu.c                    |   29 ++
 include/linux/iommu.h                    |    1 +
 include/uapi/linux/kvm.h                 |    3 +
 16 files changed, 1159 insertions(+), 36 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/6] KVM: PPC: Make lookup_linux_pte public
  2013-05-06  7:25 ` Alexey Kardashevskiy
  (?)
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

The lookup_linux_pte() function returns a linux PTE which is needed in
the process of converting KVM guest physical address into host real
address in real mode.

This conversion will be used by upcoming support of H_PUT_TCE_INDIRECT,
as the TCE list address comes from the guest and is a guest physical
address.  This makes lookup_linux_pte() public so that code can call
it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_ppc.h  |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |    5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 41426c9..99da298 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -379,4 +379,7 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb)
 	return ea;
 }
 
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		int writing, unsigned long *pte_sizep);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 6dcbb49..18fc382 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -134,8 +134,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		       int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
@@ -154,6 +154,7 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 		return __pte(0);
 	return kvmppc_read_update_linux_pte(ptep, writing);
 }
+EXPORT_SYMBOL_GPL(lookup_linux_pte);
 
 static inline void unlock_hpte(unsigned long *hpte, unsigned long hpte_v)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/6] KVM: PPC: Make lookup_linux_pte public
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	Alex Williamson, Paul Mackerras, Joerg Roedel, David Gibson

The lookup_linux_pte() function returns a linux PTE which is needed in
the process of converting KVM guest physical address into host real
address in real mode.

This conversion will be used by upcoming support of H_PUT_TCE_INDIRECT,
as the TCE list address comes from the guest and is a guest physical
address.  This makes lookup_linux_pte() public so that code can call
it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_ppc.h  |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |    5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 41426c9..99da298 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -379,4 +379,7 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb)
 	return ea;
 }
 
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		int writing, unsigned long *pte_sizep);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 6dcbb49..18fc382 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -134,8 +134,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		       int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
@@ -154,6 +154,7 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 		return __pte(0);
 	return kvmppc_read_update_linux_pte(ptep, writing);
 }
+EXPORT_SYMBOL_GPL(lookup_linux_pte);
 
 static inline void unlock_hpte(unsigned long *hpte, unsigned long hpte_v)
 {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/6] KVM: PPC: Make lookup_linux_pte public
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

The lookup_linux_pte() function returns a linux PTE which is needed in
the process of converting KVM guest physical address into host real
address in real mode.

This conversion will be used by upcoming support of H_PUT_TCE_INDIRECT,
as the TCE list address comes from the guest and is a guest physical
address.  This makes lookup_linux_pte() public so that code can call
it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_ppc.h  |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |    5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 41426c9..99da298 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -379,4 +379,7 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb)
 	return ea;
 }
 
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		int writing, unsigned long *pte_sizep);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 6dcbb49..18fc382 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -134,8 +134,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		       int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
@@ -154,6 +154,7 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 		return __pte(0);
 	return kvmppc_read_update_linux_pte(ptep, writing);
 }
+EXPORT_SYMBOL_GPL(lookup_linux_pte);
 
 static inline void unlock_hpte(unsigned long *hpte, unsigned long hpte_v)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
  2013-05-06  7:25 ` Alexey Kardashevskiy
  (?)
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt       |   15 ++
 arch/powerpc/include/asm/kvm_ppc.h      |   15 +-
 arch/powerpc/kvm/book3s_64_vio.c        |  114 +++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  231 +++++++++++++++++++++++++++----
 arch/powerpc/kvm/book3s_hv.c            |   23 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 9 files changed, 413 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index a4df553..f621cd6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2463,3 +2463,18 @@ For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
    where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
  - The tsize field of mas1 shall be set to 4K on TLB0, even though the
    hardware ignores this value for TLB0.
+
+
+6.4 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 99da298..d501246 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -139,8 +139,19 @@ extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-			     unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+		struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+		unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 72ffc89..643ac1e 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -36,9 +37,14 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
+/*
+ * TCE tables handlers.
+ */
 static long kvmppc_stt_npages(unsigned long window_size)
 {
 	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -148,3 +154,111 @@ fail:
 	}
 	return ret;
 }
+
+/*
+ * Virtual mode handling of IOMMU map/unmap.
+ */
+/* Converts guest physical address into host virtual */
+static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/*
+	 * Convert gfn to hva preserving flags and an offset
+	 * within a system page
+	 */
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+	return hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+	unsigned long tces;
+
+	/* The whole table addressed by tce_list resides in 4K page */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	tces = get_virt_address(vcpu, tce_list);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		unsigned long tce;
+		unsigned long ptce = tces + i * sizeof(unsigned long);
+
+		if (get_user(tce, (unsigned long __user *)ptce))
+			break;
+
+		if (kvmppc_emulated_h_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
+			break;
+	}
+	if (i == npages)
+		return H_SUCCESS;
+
+	/* Failed, do cleanup */
+	do {
+		--i;
+		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				0);
+	} while (i);
+
+	return H_PARAMETER;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..55fdf7a 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -35,42 +36,214 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+/*
+ * Finds a TCE table descriptor by LIOBN.
  */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == liobn)
+			return tt;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Works in both real and virtual modes.
+ */
+long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
+	/*	    liobn, tt, tt->window_size); */
+	if (ioba >= tt->window_size) {
+		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
+		return H_PARAMETER;
+	}
+	/*
+	 * Note on the use of page_address() in real mode,
+	 *
+	 * It is safe to use page_address() in real mode on ppc64 because
+	 * page_address() is always defined as lowmem_page_address()
+	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+	 * operation and does not access page struct.
+	 *
+	 * Theoretically page_address() could be defined different
+	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+	 * should be enabled.
+	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+	 * is not expected to be enabled on ppc32, page_address()
+	 * is safe for ppc32 as well.
+	 */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	page = tt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/*
+	 * Validate TCE address.
+	 * At the moment only flags are validated
+	 * as other check will significantly slow down
+	 * or can make it even impossible to handle TCE requests
+	 * in real mode.
+	 */
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address into host real address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long get_real_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, pg_size = 0, hwaddr, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return ERROR_ADDR;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	/* Copy outer values if required */
+	if (pg_sizep)
+		*pg_sizep = pg_size;
+	if (ptep)
+		*ptep = pte;
+
+	return hwaddr;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
-	struct kvmppc_spapr_tce_table *stt;
-
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
-
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
-
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
-
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
-
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+	unsigned long tces;
+
+	/* The whole table addressed by tce_list resides in 4K page */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		unsigned long tce;
+		unsigned long ptce = tces + i * sizeof(unsigned long);
+
+		if (get_user(tce, (unsigned long __user *)ptce))
+			break;
+
+		if (kvmppc_emulated_h_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT), tce))
+			break;
 	}
+	if (i == npages)
+		return H_SUCCESS;
+
+	/* Failed, do cleanup */
+	do {
+		--i;
+		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				0);
+	} while (i);
+
+	return H_PARAMETER;
+}
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
 }
+
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e5afdcb..6eb6f44 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -565,6 +565,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 			ret = kvmppc_xics_hcall(vcpu, req);
 			break;
 		} /* fallthrough */
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index d3e26be..4d73406 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1472,6 +1472,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index 8352cac..603e4f2 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -249,6 +280,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index f9c159e..b7ad589 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -384,6 +384,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_SPAPR_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6f49c87..6c04da1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -640,6 +640,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_HTAB_FD 84
 #define KVM_CAP_PPC_RTAS (0x100000 + 87)
 #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
+#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	Alex Williamson, Paul Mackerras, Joerg Roedel, David Gibson

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt       |   15 ++
 arch/powerpc/include/asm/kvm_ppc.h      |   15 +-
 arch/powerpc/kvm/book3s_64_vio.c        |  114 +++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  231 +++++++++++++++++++++++++++----
 arch/powerpc/kvm/book3s_hv.c            |   23 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 9 files changed, 413 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index a4df553..f621cd6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2463,3 +2463,18 @@ For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
    where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
  - The tsize field of mas1 shall be set to 4K on TLB0, even though the
    hardware ignores this value for TLB0.
+
+
+6.4 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 99da298..d501246 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -139,8 +139,19 @@ extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-			     unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+		struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+		unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 72ffc89..643ac1e 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -36,9 +37,14 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
+/*
+ * TCE tables handlers.
+ */
 static long kvmppc_stt_npages(unsigned long window_size)
 {
 	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -148,3 +154,111 @@ fail:
 	}
 	return ret;
 }
+
+/*
+ * Virtual mode handling of IOMMU map/unmap.
+ */
+/* Converts guest physical address into host virtual */
+static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/*
+	 * Convert gfn to hva preserving flags and an offset
+	 * within a system page
+	 */
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+	return hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+	unsigned long tces;
+
+	/* The whole table addressed by tce_list resides in 4K page */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	tces = get_virt_address(vcpu, tce_list);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		unsigned long tce;
+		unsigned long ptce = tces + i * sizeof(unsigned long);
+
+		if (get_user(tce, (unsigned long __user *)ptce))
+			break;
+
+		if (kvmppc_emulated_h_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
+			break;
+	}
+	if (i == npages)
+		return H_SUCCESS;
+
+	/* Failed, do cleanup */
+	do {
+		--i;
+		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				0);
+	} while (i);
+
+	return H_PARAMETER;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..55fdf7a 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -35,42 +36,214 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+/*
+ * Finds a TCE table descriptor by LIOBN.
  */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == liobn)
+			return tt;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Works in both real and virtual modes.
+ */
+long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
+	/*	    liobn, tt, tt->window_size); */
+	if (ioba >= tt->window_size) {
+		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
+		return H_PARAMETER;
+	}
+	/*
+	 * Note on the use of page_address() in real mode,
+	 *
+	 * It is safe to use page_address() in real mode on ppc64 because
+	 * page_address() is always defined as lowmem_page_address()
+	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+	 * operation and does not access page struct.
+	 *
+	 * Theoretically page_address() could be defined different
+	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+	 * should be enabled.
+	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+	 * is not expected to be enabled on ppc32, page_address()
+	 * is safe for ppc32 as well.
+	 */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	page = tt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/*
+	 * Validate TCE address.
+	 * At the moment only flags are validated
+	 * as other check will significantly slow down
+	 * or can make it even impossible to handle TCE requests
+	 * in real mode.
+	 */
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address into host real address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long get_real_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, pg_size = 0, hwaddr, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return ERROR_ADDR;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	/* Copy outer values if required */
+	if (pg_sizep)
+		*pg_sizep = pg_size;
+	if (ptep)
+		*ptep = pte;
+
+	return hwaddr;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
-	struct kvmppc_spapr_tce_table *stt;
-
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
-
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
-
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
-
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
-
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+	unsigned long tces;
+
+	/* The whole table addressed by tce_list resides in 4K page */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		unsigned long tce;
+		unsigned long ptce = tces + i * sizeof(unsigned long);
+
+		if (get_user(tce, (unsigned long __user *)ptce))
+			break;
+
+		if (kvmppc_emulated_h_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT), tce))
+			break;
 	}
+	if (i == npages)
+		return H_SUCCESS;
+
+	/* Failed, do cleanup */
+	do {
+		--i;
+		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				0);
+	} while (i);
+
+	return H_PARAMETER;
+}
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
 }
+
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e5afdcb..6eb6f44 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -565,6 +565,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 			ret = kvmppc_xics_hcall(vcpu, req);
 			break;
 		} /* fallthrough */
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index d3e26be..4d73406 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1472,6 +1472,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index 8352cac..603e4f2 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -249,6 +280,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index f9c159e..b7ad589 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -384,6 +384,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_SPAPR_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6f49c87..6c04da1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -640,6 +640,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_HTAB_FD 84
 #define KVM_CAP_PPC_RTAS (0x100000 + 87)
 #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
+#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt       |   15 ++
 arch/powerpc/include/asm/kvm_ppc.h      |   15 +-
 arch/powerpc/kvm/book3s_64_vio.c        |  114 +++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  231 +++++++++++++++++++++++++++----
 arch/powerpc/kvm/book3s_hv.c            |   23 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 9 files changed, 413 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index a4df553..f621cd6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2463,3 +2463,18 @@ For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
    where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
  - The tsize field of mas1 shall be set to 4K on TLB0, even though the
    hardware ignores this value for TLB0.
+
+
+6.4 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 99da298..d501246 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -139,8 +139,19 @@ extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-			     unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+		struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+		unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 72ffc89..643ac1e 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -36,9 +37,14 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
+/*
+ * TCE tables handlers.
+ */
 static long kvmppc_stt_npages(unsigned long window_size)
 {
 	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -148,3 +154,111 @@ fail:
 	}
 	return ret;
 }
+
+/*
+ * Virtual mode handling of IOMMU map/unmap.
+ */
+/* Converts guest physical address into host virtual */
+static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/*
+	 * Convert gfn to hva preserving flags and an offset
+	 * within a system page
+	 */
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+	return hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+	unsigned long tces;
+
+	/* The whole table addressed by tce_list resides in 4K page */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	tces = get_virt_address(vcpu, tce_list);
+	if (tces = ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		unsigned long tce;
+		unsigned long ptce = tces + i * sizeof(unsigned long);
+
+		if (get_user(tce, (unsigned long __user *)ptce))
+			break;
+
+		if (kvmppc_emulated_h_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
+			break;
+	}
+	if (i = npages)
+		return H_SUCCESS;
+
+	/* Failed, do cleanup */
+	do {
+		--i;
+		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				0);
+	} while (i);
+
+	return H_PARAMETER;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..55fdf7a 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -35,42 +36,214 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+/*
+ * Finds a TCE table descriptor by LIOBN.
  */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn = liobn)
+			return tt;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Works in both real and virtual modes.
+ */
+long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
+	/*	    liobn, tt, tt->window_size); */
+	if (ioba >= tt->window_size) {
+		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
+		return H_PARAMETER;
+	}
+	/*
+	 * Note on the use of page_address() in real mode,
+	 *
+	 * It is safe to use page_address() in real mode on ppc64 because
+	 * page_address() is always defined as lowmem_page_address()
+	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+	 * operation and does not access page struct.
+	 *
+	 * Theoretically page_address() could be defined different
+	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+	 * should be enabled.
+	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+	 * is not expected to be enabled on ppc32, page_address()
+	 * is safe for ppc32 as well.
+	 */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	page = tt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/*
+	 * Validate TCE address.
+	 * At the moment only flags are validated
+	 * as other check will significantly slow down
+	 * or can make it even impossible to handle TCE requests
+	 * in real mode.
+	 */
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address into host real address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long get_real_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, pg_size = 0, hwaddr, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return ERROR_ADDR;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	/* Copy outer values if required */
+	if (pg_sizep)
+		*pg_sizep = pg_size;
+	if (ptep)
+		*ptep = pte;
+
+	return hwaddr;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
-	struct kvmppc_spapr_tce_table *stt;
-
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
-
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn = liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
-
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
-
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
-
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+	unsigned long tces;
+
+	/* The whole table addressed by tce_list resides in 4K page */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
+	if (tces = ERROR_ADDR)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		unsigned long tce;
+		unsigned long ptce = tces + i * sizeof(unsigned long);
+
+		if (get_user(tce, (unsigned long __user *)ptce))
+			break;
+
+		if (kvmppc_emulated_h_put_tce(tt,
+				ioba + (i << IOMMU_PAGE_SHIFT), tce))
+			break;
 	}
+	if (i = npages)
+		return H_SUCCESS;
+
+	/* Failed, do cleanup */
+	do {
+		--i;
+		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				0);
+	} while (i);
+
+	return H_PARAMETER;
+}
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to virtual space */
+	if (!tt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
 }
+
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e5afdcb..6eb6f44 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -565,6 +565,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 			ret = kvmppc_xics_hcall(vcpu, req);
 			break;
 		} /* fallthrough */
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6));
+		if (ret = H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret = H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret = H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index d3e26be..4d73406 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1472,6 +1472,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index 8352cac..603e4f2 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+	if (rc = H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc = H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc = H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -249,6 +280,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index f9c159e..b7ad589 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -384,6 +384,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_SPAPR_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6f49c87..6c04da1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -640,6 +640,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_HTAB_FD 84
 #define KVM_CAP_PPC_RTAS (0x100000 + 87)
 #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
+#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] powerpc: Prepare to support kernel handling of IOMMU map/unmap
  2013-05-06  7:25 ` Alexey Kardashevskiy
  (?)
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 ++
 arch/powerpc/mm/init_64.c                |   77 +++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..4c56ede 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..838b8ae 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,80 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
+
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+	if (PageTail(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	put_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] powerpc: Prepare to support kernel handling of IOMMU map/unmap
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	Alex Williamson, Paul Mackerras, Joerg Roedel, David Gibson

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 ++
 arch/powerpc/mm/init_64.c                |   77 +++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..4c56ede 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..838b8ae 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,80 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
+
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+	if (PageTail(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	put_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] powerpc: Prepare to support kernel handling of IOMMU map/unmap
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    4 ++
 arch/powerpc/mm/init_64.c                |   77 +++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..4c56ede 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..838b8ae 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,80 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
+
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+	if (PageTail(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	put_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] iommu: Add a function to find an iommu group by id
  2013-05-06  7:25 ` Alexey Kardashevskiy
  (?)
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

As IOMMU groups are exposed to the user space by their numbers,
the user space can use them in various kernel APIs so the kernel
might need an API to find a group by its ID.

As an example, QEMU VFIO on PPC64 platform needs it to associate
a logical bus number (LIOBN) with a specific IOMMU group in order
to support in-kernel handling of DMA map/unmap requests.

This adds the iommu_group_get_by_id(id) function which performs
this search.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 drivers/iommu/iommu.c |   29 +++++++++++++++++++++++++++++
 include/linux/iommu.h |    1 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ddbdaca..5514dfa 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -204,6 +204,35 @@ again:
 }
 EXPORT_SYMBOL_GPL(iommu_group_alloc);
 
+struct iommu_group *iommu_group_get_by_id(int id)
+{
+	struct kobject *group_kobj;
+	struct iommu_group *group;
+	const char *name;
+
+	if (!iommu_group_kset)
+		return NULL;
+
+	name = kasprintf(GFP_KERNEL, "%d", id);
+	if (!name)
+		return NULL;
+
+	group_kobj = kset_find_obj(iommu_group_kset, name);
+	kfree(name);
+
+	if (!group_kobj)
+		return NULL;
+
+	group = container_of(group_kobj, struct iommu_group, kobj);
+	BUG_ON(group->id != id);
+
+	kobject_get(group->devices_kobj);
+	kobject_put(&group->kobj);
+
+	return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get_by_id);
+
 /**
  * iommu_group_get_iommudata - retrieve iommu_data registered for a group
  * @group: the group
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f3b99e1..00e5d7d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -113,6 +113,7 @@ struct iommu_ops {
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
+extern struct iommu_group *iommu_group_get_by_id(int id);
 extern void iommu_domain_free(struct iommu_domain *domain);
 extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] iommu: Add a function to find an iommu group by id
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	Alex Williamson, Paul Mackerras, Joerg Roedel, David Gibson

As IOMMU groups are exposed to the user space by their numbers,
the user space can use them in various kernel APIs so the kernel
might need an API to find a group by its ID.

As an example, QEMU VFIO on PPC64 platform needs it to associate
a logical bus number (LIOBN) with a specific IOMMU group in order
to support in-kernel handling of DMA map/unmap requests.

This adds the iommu_group_get_by_id(id) function which performs
this search.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 drivers/iommu/iommu.c |   29 +++++++++++++++++++++++++++++
 include/linux/iommu.h |    1 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ddbdaca..5514dfa 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -204,6 +204,35 @@ again:
 }
 EXPORT_SYMBOL_GPL(iommu_group_alloc);
 
+struct iommu_group *iommu_group_get_by_id(int id)
+{
+	struct kobject *group_kobj;
+	struct iommu_group *group;
+	const char *name;
+
+	if (!iommu_group_kset)
+		return NULL;
+
+	name = kasprintf(GFP_KERNEL, "%d", id);
+	if (!name)
+		return NULL;
+
+	group_kobj = kset_find_obj(iommu_group_kset, name);
+	kfree(name);
+
+	if (!group_kobj)
+		return NULL;
+
+	group = container_of(group_kobj, struct iommu_group, kobj);
+	BUG_ON(group->id != id);
+
+	kobject_get(group->devices_kobj);
+	kobject_put(&group->kobj);
+
+	return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get_by_id);
+
 /**
  * iommu_group_get_iommudata - retrieve iommu_data registered for a group
  * @group: the group
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f3b99e1..00e5d7d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -113,6 +113,7 @@ struct iommu_ops {
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
+extern struct iommu_group *iommu_group_get_by_id(int id);
 extern void iommu_domain_free(struct iommu_domain *domain);
 extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] iommu: Add a function to find an iommu group by id
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

As IOMMU groups are exposed to the user space by their numbers,
the user space can use them in various kernel APIs so the kernel
might need an API to find a group by its ID.

As an example, QEMU VFIO on PPC64 platform needs it to associate
a logical bus number (LIOBN) with a specific IOMMU group in order
to support in-kernel handling of DMA map/unmap requests.

This adds the iommu_group_get_by_id(id) function which performs
this search.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 drivers/iommu/iommu.c |   29 +++++++++++++++++++++++++++++
 include/linux/iommu.h |    1 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ddbdaca..5514dfa 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -204,6 +204,35 @@ again:
 }
 EXPORT_SYMBOL_GPL(iommu_group_alloc);
 
+struct iommu_group *iommu_group_get_by_id(int id)
+{
+	struct kobject *group_kobj;
+	struct iommu_group *group;
+	const char *name;
+
+	if (!iommu_group_kset)
+		return NULL;
+
+	name = kasprintf(GFP_KERNEL, "%d", id);
+	if (!name)
+		return NULL;
+
+	group_kobj = kset_find_obj(iommu_group_kset, name);
+	kfree(name);
+
+	if (!group_kobj)
+		return NULL;
+
+	group = container_of(group_kobj, struct iommu_group, kobj);
+	BUG_ON(group->id != id);
+
+	kobject_get(group->devices_kobj);
+	kobject_put(&group->kobj);
+
+	return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get_by_id);
+
 /**
  * iommu_group_get_iommudata - retrieve iommu_data registered for a group
  * @group: the group
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f3b99e1..00e5d7d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -113,6 +113,7 @@ struct iommu_ops {
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
+extern struct iommu_group *iommu_group_get_by_id(int id);
 extern void iommu_domain_free(struct iommu_domain *domain);
 extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-06  7:25 ` Alexey Kardashevskiy
  (?)
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

This adds a special case for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

This also adds the virt_only parameter to the KVM module
for debug and performance check purposes.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt   |   28 ++++
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    7 +
 arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c          |   12 ++
 include/uapi/linux/kvm.h            |    2 +
 8 files changed, 485 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index f621cd6..2039767 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
 
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 36ceb0d..2b70cbc 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	bool virtmode_only;
+	struct iommu_group *grp;    /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index d501246..bdfa140 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
 		struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 681b314..b67d44b 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 643ac1e..98cf949 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,9 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -38,10 +41,19 @@
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
 #include <asm/iommu.h>
+#include <asm/tce.h>
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
+#define DRIVER_DESC	"POWERPC KVM driver"
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR      (~(unsigned long)0x0)
 
+static bool kvmppc_tce_virt_only = false;
+module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
+
 /*
  * TCE tables handlers.
  */
@@ -58,8 +70,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+	if (stt->grp) {
+		iommu_group_put(stt->grp);
+	} else
+#endif
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -155,9 +172,127 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+	.release	= kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *tt = NULL;
+	struct iommu_group *grp;
+	struct iommu_table *tbl;
+
+	/* Find an IOMMU table for the given ID */
+	grp = iommu_group_get_by_id(args->iommu_id);
+	if (!grp)
+		return -ENXIO;
+
+	tbl = iommu_group_get_iommudata(grp);
+	if (!tbl)
+		return -ENXIO;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+	if (!tt)
+		return -ENOMEM;
+
+	tt->liobn = args->liobn;
+	tt->kvm = kvm;
+	tt->virtmode_only = kvmppc_tce_virt_only;
+	tt->grp = grp;
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+			args->liobn, args->iommu_id, args->flags);
+
+	return anon_inode_getfd("kvm-spapr-tce-iommu",
+			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
+#ifdef CONFIG_IOMMU_API
 /*
  * Virtual mode handling of IOMMU map/unmap.
  */
+static int clear_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (ret)
+		return ret;
+
+	ret = iommu_clear_tces_and_put_pages(tbl, entry, npages);
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret);
+
+	return ret;
+}
+
+static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
+		struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_put_param_check(tbl, ioba, tce);
+	if (ret)
+		return ret;
+
+	/* System page size case, easy to handle */
+	if (pg_size == PAGE_SIZE)
+		return iommu_put_tce_user_mode(tbl, entry, tce);
+
+	return -EAGAIN;
+}
+
+static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
+		unsigned long hva, bool writing, unsigned long *pg_sizep)
+{
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+	/* Find out the page pte and size if requested */
+	pte_t pte;
+	unsigned long pg_size = 0;
+
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return 0;
+
+	*pg_sizep = pg_size;
+
+	return pte;
+#else
+	return 0;
+#endif
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Converts guest physical address into host virtual */
 static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
 		unsigned long gpa)
@@ -188,6 +323,43 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte;
+
+			hpa = get_virt_address(vcpu, tce);
+			if (hpa == ERROR_ADDR)
+				return -EFAULT;
+
+			pte = va_to_linux_pte(vcpu, hpa, tce & TCE_PCI_WRITE,
+					&pg_size);
+			if (!pte)
+				return -EFAULT;
+
+			ret = put_tce_virt_mode(tt, tbl, ioba, hpa,
+					pte, pg_size);
+		} else {
+			ret = clear_tce_virt_mode(tbl, ioba, 0, 1);
+		}
+		iommu_flush_tce(tbl);
+
+		WARN_ON(ret == -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
 }
@@ -213,6 +385,52 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret = 0;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+			unsigned long tce;
+			unsigned long ptce = tces + i * sizeof(unsigned long);
+
+			if (get_user(tce, (unsigned long __user *)ptce))
+				break;
+
+			hpa = get_virt_address(vcpu, tce);
+			if (hpa == ERROR_ADDR)
+				return -EFAULT;
+
+			pte = va_to_linux_pte(vcpu, hpa,
+					tce & TCE_PCI_WRITE, &pg_size);
+			if (!pte)
+				return -EFAULT;
+
+			ret = put_tce_virt_mode(tt, tbl,
+					ioba + (i << IOMMU_PAGE_SHIFT),
+					hpa, pte, pg_size);
+			if (ret)
+				break;
+		}
+		if (ret)
+			clear_tce_virt_mode(tbl, ioba, 0, i);
+
+		iommu_flush_tce(tbl);
+
+		WARN_ON(ret == -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -253,6 +471,26 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		ret = clear_tce_virt_mode(tbl, ioba,
+				tce_value, npages);
+
+		WARN_ON(ret == -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 55fdf7a..c5e5905 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -161,6 +162,85 @@ static unsigned long get_real_address(struct kvm_vcpu *vcpu,
 	return hwaddr;
 }
 
+#ifdef CONFIG_IOMMU_API
+static int clear_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (ret)
+		return ret;
+
+	for ( ; npages; --npages, ++entry) {
+		struct page *page;
+		unsigned long oldtce;
+
+		oldtce = iommu_clear_tce(tbl, entry);
+		if (!oldtce)
+			continue;
+
+		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+		if (!page) {
+			ret = -EAGAIN;
+			break;
+		}
+
+		if (oldtce & TCE_PCI_WRITE)
+			SetPageDirty(page);
+
+		ret = realmode_put_page(page);
+		if (ret)
+			break;
+	}
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret); */
+
+	return ret;
+}
+
+static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
+		struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+	struct page *page = NULL;
+	enum dma_data_direction direction = iommu_tce_direction(tce);
+
+	ret = iommu_tce_put_param_check(tbl, ioba, tce);
+	if (ret)
+		return ret;
+
+	if (pg_size != PAGE_SIZE)
+		return -EAGAIN;
+
+	/* Small page case, find page struct to increment a counter */
+	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
+	if (!page)
+		return -EAGAIN;
+
+	ret = realmode_get_page(page);
+	if (ret)
+		return ret;
+
+	/* tce_build accepts virtual addresses */
+	ret = iommu_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
+	if (ret)
+		realmode_put_page(page);
+
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+				__func__, ioba, tce, ret); */
+
+	return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -171,6 +251,44 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+
+			hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
+					&pte, &pg_size);
+			if (hpa == ERROR_ADDR)
+				return H_TOO_HARD;
+
+			ret = put_tce_real_mode(tt, tbl, ioba,
+					hpa, pte, pg_size);
+		} else {
+			ret = clear_tce_real_mode(tbl, ioba, 0, 1);
+		}
+		iommu_flush_tce(tbl);
+
+		if (ret == -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
 }
@@ -192,10 +310,58 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
 	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret = 0;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+			unsigned long tce;
+			unsigned long ptce = tces + i * sizeof(unsigned long);
+
+			if (get_user(tce, (unsigned long __user *)ptce))
+				break;
+
+			hpa = get_real_address(vcpu, tce,
+					tce & TCE_PCI_WRITE,
+					&pte, &pg_size);
+			if (hpa == ERROR_ADDR)
+				ret = -EAGAIN;
+			else
+				ret = put_tce_real_mode(tt, tbl,
+						ioba + (i << IOMMU_PAGE_SHIFT),
+						hpa, pte, pg_size);
+			if (ret)
+				break;
+		}
+		if (ret)
+			clear_tce_real_mode(tbl, ioba, 0, i);
+
+		iommu_flush_tce(tbl);
+
+		if (ret == -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -236,6 +402,32 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		ret = clear_tce_real_mode(tbl, ioba,
+				tce_value, npages);
+		iommu_flush_tce(tbl);
+
+		if (ret == -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b7ad589..269b0f6 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -385,6 +385,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		break;
 #endif
 	case KVM_CAP_SPAPR_MULTITCE:
+	case KVM_CAP_SPAPR_TCE_IOMMU:
 		r = 1;
 		break;
 	default:
@@ -935,6 +936,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6c04da1..161e1d3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -641,6 +641,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS (0x100000 + 87)
 #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
 #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -885,6 +886,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 /* Available with KVM_CAP_PPC_RTAS */
 #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xdc, struct kvm_rtas_token_args)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	Alex Williamson, Paul Mackerras, Joerg Roedel, David Gibson

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

This adds a special case for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

This also adds the virt_only parameter to the KVM module
for debug and performance check purposes.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt   |   28 ++++
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    7 +
 arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c          |   12 ++
 include/uapi/linux/kvm.h            |    2 +
 8 files changed, 485 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index f621cd6..2039767 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
 
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 36ceb0d..2b70cbc 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	bool virtmode_only;
+	struct iommu_group *grp;    /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index d501246..bdfa140 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
 		struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 681b314..b67d44b 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 643ac1e..98cf949 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,9 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -38,10 +41,19 @@
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
 #include <asm/iommu.h>
+#include <asm/tce.h>
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
+#define DRIVER_DESC	"POWERPC KVM driver"
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR      (~(unsigned long)0x0)
 
+static bool kvmppc_tce_virt_only = false;
+module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
+
 /*
  * TCE tables handlers.
  */
@@ -58,8 +70,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+	if (stt->grp) {
+		iommu_group_put(stt->grp);
+	} else
+#endif
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -155,9 +172,127 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+	.release	= kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *tt = NULL;
+	struct iommu_group *grp;
+	struct iommu_table *tbl;
+
+	/* Find an IOMMU table for the given ID */
+	grp = iommu_group_get_by_id(args->iommu_id);
+	if (!grp)
+		return -ENXIO;
+
+	tbl = iommu_group_get_iommudata(grp);
+	if (!tbl)
+		return -ENXIO;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+	if (!tt)
+		return -ENOMEM;
+
+	tt->liobn = args->liobn;
+	tt->kvm = kvm;
+	tt->virtmode_only = kvmppc_tce_virt_only;
+	tt->grp = grp;
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+			args->liobn, args->iommu_id, args->flags);
+
+	return anon_inode_getfd("kvm-spapr-tce-iommu",
+			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
+#ifdef CONFIG_IOMMU_API
 /*
  * Virtual mode handling of IOMMU map/unmap.
  */
+static int clear_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (ret)
+		return ret;
+
+	ret = iommu_clear_tces_and_put_pages(tbl, entry, npages);
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret);
+
+	return ret;
+}
+
+static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
+		struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_put_param_check(tbl, ioba, tce);
+	if (ret)
+		return ret;
+
+	/* System page size case, easy to handle */
+	if (pg_size == PAGE_SIZE)
+		return iommu_put_tce_user_mode(tbl, entry, tce);
+
+	return -EAGAIN;
+}
+
+static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
+		unsigned long hva, bool writing, unsigned long *pg_sizep)
+{
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+	/* Find out the page pte and size if requested */
+	pte_t pte;
+	unsigned long pg_size = 0;
+
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return 0;
+
+	*pg_sizep = pg_size;
+
+	return pte;
+#else
+	return 0;
+#endif
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Converts guest physical address into host virtual */
 static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
 		unsigned long gpa)
@@ -188,6 +323,43 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte;
+
+			hpa = get_virt_address(vcpu, tce);
+			if (hpa == ERROR_ADDR)
+				return -EFAULT;
+
+			pte = va_to_linux_pte(vcpu, hpa, tce & TCE_PCI_WRITE,
+					&pg_size);
+			if (!pte)
+				return -EFAULT;
+
+			ret = put_tce_virt_mode(tt, tbl, ioba, hpa,
+					pte, pg_size);
+		} else {
+			ret = clear_tce_virt_mode(tbl, ioba, 0, 1);
+		}
+		iommu_flush_tce(tbl);
+
+		WARN_ON(ret == -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
 }
@@ -213,6 +385,52 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret = 0;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+			unsigned long tce;
+			unsigned long ptce = tces + i * sizeof(unsigned long);
+
+			if (get_user(tce, (unsigned long __user *)ptce))
+				break;
+
+			hpa = get_virt_address(vcpu, tce);
+			if (hpa == ERROR_ADDR)
+				return -EFAULT;
+
+			pte = va_to_linux_pte(vcpu, hpa,
+					tce & TCE_PCI_WRITE, &pg_size);
+			if (!pte)
+				return -EFAULT;
+
+			ret = put_tce_virt_mode(tt, tbl,
+					ioba + (i << IOMMU_PAGE_SHIFT),
+					hpa, pte, pg_size);
+			if (ret)
+				break;
+		}
+		if (ret)
+			clear_tce_virt_mode(tbl, ioba, 0, i);
+
+		iommu_flush_tce(tbl);
+
+		WARN_ON(ret == -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -253,6 +471,26 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		ret = clear_tce_virt_mode(tbl, ioba,
+				tce_value, npages);
+
+		WARN_ON(ret == -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 55fdf7a..c5e5905 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -161,6 +162,85 @@ static unsigned long get_real_address(struct kvm_vcpu *vcpu,
 	return hwaddr;
 }
 
+#ifdef CONFIG_IOMMU_API
+static int clear_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (ret)
+		return ret;
+
+	for ( ; npages; --npages, ++entry) {
+		struct page *page;
+		unsigned long oldtce;
+
+		oldtce = iommu_clear_tce(tbl, entry);
+		if (!oldtce)
+			continue;
+
+		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+		if (!page) {
+			ret = -EAGAIN;
+			break;
+		}
+
+		if (oldtce & TCE_PCI_WRITE)
+			SetPageDirty(page);
+
+		ret = realmode_put_page(page);
+		if (ret)
+			break;
+	}
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret); */
+
+	return ret;
+}
+
+static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
+		struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+	struct page *page = NULL;
+	enum dma_data_direction direction = iommu_tce_direction(tce);
+
+	ret = iommu_tce_put_param_check(tbl, ioba, tce);
+	if (ret)
+		return ret;
+
+	if (pg_size != PAGE_SIZE)
+		return -EAGAIN;
+
+	/* Small page case, find page struct to increment a counter */
+	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
+	if (!page)
+		return -EAGAIN;
+
+	ret = realmode_get_page(page);
+	if (ret)
+		return ret;
+
+	/* tce_build accepts virtual addresses */
+	ret = iommu_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
+	if (ret)
+		realmode_put_page(page);
+
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+				__func__, ioba, tce, ret); */
+
+	return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -171,6 +251,44 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+
+			hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
+					&pte, &pg_size);
+			if (hpa == ERROR_ADDR)
+				return H_TOO_HARD;
+
+			ret = put_tce_real_mode(tt, tbl, ioba,
+					hpa, pte, pg_size);
+		} else {
+			ret = clear_tce_real_mode(tbl, ioba, 0, 1);
+		}
+		iommu_flush_tce(tbl);
+
+		if (ret == -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
 }
@@ -192,10 +310,58 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
 	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret = 0;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+			unsigned long tce;
+			unsigned long ptce = tces + i * sizeof(unsigned long);
+
+			if (get_user(tce, (unsigned long __user *)ptce))
+				break;
+
+			hpa = get_real_address(vcpu, tce,
+					tce & TCE_PCI_WRITE,
+					&pte, &pg_size);
+			if (hpa == ERROR_ADDR)
+				ret = -EAGAIN;
+			else
+				ret = put_tce_real_mode(tt, tbl,
+						ioba + (i << IOMMU_PAGE_SHIFT),
+						hpa, pte, pg_size);
+			if (ret)
+				break;
+		}
+		if (ret)
+			clear_tce_real_mode(tbl, ioba, 0, i);
+
+		iommu_flush_tce(tbl);
+
+		if (ret == -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -236,6 +402,32 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		ret = clear_tce_real_mode(tbl, ioba,
+				tce_value, npages);
+		iommu_flush_tce(tbl);
+
+		if (ret == -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b7ad589..269b0f6 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -385,6 +385,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		break;
 #endif
 	case KVM_CAP_SPAPR_MULTITCE:
+	case KVM_CAP_SPAPR_TCE_IOMMU:
 		r = 1;
 		break;
 	default:
@@ -935,6 +936,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6c04da1..161e1d3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -641,6 +641,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS (0x100000 + 87)
 #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
 #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -885,6 +886,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 /* Available with KVM_CAP_PPC_RTAS */
 #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xdc, struct kvm_rtas_token_args)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

This adds a special case for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

This also adds the virt_only parameter to the KVM module
for debug and performance check purposes.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt   |   28 ++++
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    7 +
 arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c          |   12 ++
 include/uapi/linux/kvm.h            |    2 +
 8 files changed, 485 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index f621cd6..2039767 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
 
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 36ceb0d..2b70cbc 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	bool virtmode_only;
+	struct iommu_group *grp;    /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index d501246..bdfa140 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
 		struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 681b314..b67d44b 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 643ac1e..98cf949 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,9 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -38,10 +41,19 @@
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
 #include <asm/iommu.h>
+#include <asm/tce.h>
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
+#define DRIVER_DESC	"POWERPC KVM driver"
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR      (~(unsigned long)0x0)
 
+static bool kvmppc_tce_virt_only = false;
+module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
+
 /*
  * TCE tables handlers.
  */
@@ -58,8 +70,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+	if (stt->grp) {
+		iommu_group_put(stt->grp);
+	} else
+#endif
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -155,9 +172,127 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+	.release	= kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *tt = NULL;
+	struct iommu_group *grp;
+	struct iommu_table *tbl;
+
+	/* Find an IOMMU table for the given ID */
+	grp = iommu_group_get_by_id(args->iommu_id);
+	if (!grp)
+		return -ENXIO;
+
+	tbl = iommu_group_get_iommudata(grp);
+	if (!tbl)
+		return -ENXIO;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn = args->liobn)
+			return -EBUSY;
+	}
+
+	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+	if (!tt)
+		return -ENOMEM;
+
+	tt->liobn = args->liobn;
+	tt->kvm = kvm;
+	tt->virtmode_only = kvmppc_tce_virt_only;
+	tt->grp = grp;
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+			args->liobn, args->iommu_id, args->flags);
+
+	return anon_inode_getfd("kvm-spapr-tce-iommu",
+			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
+#ifdef CONFIG_IOMMU_API
 /*
  * Virtual mode handling of IOMMU map/unmap.
  */
+static int clear_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (ret)
+		return ret;
+
+	ret = iommu_clear_tces_and_put_pages(tbl, entry, npages);
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret);
+
+	return ret;
+}
+
+static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
+		struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_put_param_check(tbl, ioba, tce);
+	if (ret)
+		return ret;
+
+	/* System page size case, easy to handle */
+	if (pg_size = PAGE_SIZE)
+		return iommu_put_tce_user_mode(tbl, entry, tce);
+
+	return -EAGAIN;
+}
+
+static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
+		unsigned long hva, bool writing, unsigned long *pg_sizep)
+{
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+	/* Find out the page pte and size if requested */
+	pte_t pte;
+	unsigned long pg_size = 0;
+
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return 0;
+
+	*pg_sizep = pg_size;
+
+	return pte;
+#else
+	return 0;
+#endif
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Converts guest physical address into host virtual */
 static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
 		unsigned long gpa)
@@ -188,6 +323,43 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte;
+
+			hpa = get_virt_address(vcpu, tce);
+			if (hpa = ERROR_ADDR)
+				return -EFAULT;
+
+			pte = va_to_linux_pte(vcpu, hpa, tce & TCE_PCI_WRITE,
+					&pg_size);
+			if (!pte)
+				return -EFAULT;
+
+			ret = put_tce_virt_mode(tt, tbl, ioba, hpa,
+					pte, pg_size);
+		} else {
+			ret = clear_tce_virt_mode(tbl, ioba, 0, 1);
+		}
+		iommu_flush_tce(tbl);
+
+		WARN_ON(ret = -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
 }
@@ -213,6 +385,52 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tces = ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret = 0;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+			unsigned long tce;
+			unsigned long ptce = tces + i * sizeof(unsigned long);
+
+			if (get_user(tce, (unsigned long __user *)ptce))
+				break;
+
+			hpa = get_virt_address(vcpu, tce);
+			if (hpa = ERROR_ADDR)
+				return -EFAULT;
+
+			pte = va_to_linux_pte(vcpu, hpa,
+					tce & TCE_PCI_WRITE, &pg_size);
+			if (!pte)
+				return -EFAULT;
+
+			ret = put_tce_virt_mode(tt, tbl,
+					ioba + (i << IOMMU_PAGE_SHIFT),
+					hpa, pte, pg_size);
+			if (ret)
+				break;
+		}
+		if (ret)
+			clear_tce_virt_mode(tbl, ioba, 0, i);
+
+		iommu_flush_tce(tbl);
+
+		WARN_ON(ret = -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -253,6 +471,26 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		ret = clear_tce_virt_mode(tbl, ioba,
+				tce_value, npages);
+
+		WARN_ON(ret = -EAGAIN);
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 55fdf7a..c5e5905 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -161,6 +162,85 @@ static unsigned long get_real_address(struct kvm_vcpu *vcpu,
 	return hwaddr;
 }
 
+#ifdef CONFIG_IOMMU_API
+static int clear_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (ret)
+		return ret;
+
+	for ( ; npages; --npages, ++entry) {
+		struct page *page;
+		unsigned long oldtce;
+
+		oldtce = iommu_clear_tce(tbl, entry);
+		if (!oldtce)
+			continue;
+
+		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+		if (!page) {
+			ret = -EAGAIN;
+			break;
+		}
+
+		if (oldtce & TCE_PCI_WRITE)
+			SetPageDirty(page);
+
+		ret = realmode_put_page(page);
+		if (ret)
+			break;
+	}
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret); */
+
+	return ret;
+}
+
+static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
+		struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	int ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+	struct page *page = NULL;
+	enum dma_data_direction direction = iommu_tce_direction(tce);
+
+	ret = iommu_tce_put_param_check(tbl, ioba, tce);
+	if (ret)
+		return ret;
+
+	if (pg_size != PAGE_SIZE)
+		return -EAGAIN;
+
+	/* Small page case, find page struct to increment a counter */
+	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
+	if (!page)
+		return -EAGAIN;
+
+	ret = realmode_get_page(page);
+	if (ret)
+		return ret;
+
+	/* tce_build accepts virtual addresses */
+	ret = iommu_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
+	if (ret)
+		realmode_put_page(page);
+
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+				__func__, ioba, tce, ret); */
+
+	return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -171,6 +251,44 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+
+			hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
+					&pte, &pg_size);
+			if (hpa = ERROR_ADDR)
+				return H_TOO_HARD;
+
+			ret = put_tce_real_mode(tt, tbl, ioba,
+					hpa, pte, pg_size);
+		} else {
+			ret = clear_tce_real_mode(tbl, ioba, 0, 1);
+		}
+		iommu_flush_tce(tbl);
+
+		if (ret = -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
 }
@@ -192,10 +310,58 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
 	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
 	if (tces = ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret = 0;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa, pg_size = 0;
+			pte_t pte = 0;
+			unsigned long tce;
+			unsigned long ptce = tces + i * sizeof(unsigned long);
+
+			if (get_user(tce, (unsigned long __user *)ptce))
+				break;
+
+			hpa = get_real_address(vcpu, tce,
+					tce & TCE_PCI_WRITE,
+					&pte, &pg_size);
+			if (hpa = ERROR_ADDR)
+				ret = -EAGAIN;
+			else
+				ret = put_tce_real_mode(tt, tbl,
+						ioba + (i << IOMMU_PAGE_SHIFT),
+						hpa, pte, pg_size);
+			if (ret)
+				break;
+		}
+		if (ret)
+			clear_tce_real_mode(tbl, ioba, 0, i);
+
+		iommu_flush_tce(tbl);
+
+		if (ret = -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -236,6 +402,32 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->virtmode_only)
+		return H_TOO_HARD;
+
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		long ret;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		ret = clear_tce_real_mode(tbl, ioba,
+				tce_value, npages);
+		iommu_flush_tce(tbl);
+
+		if (ret = -EAGAIN)
+			return H_TOO_HARD;
+
+		if (ret < 0)
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
+
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b7ad589..269b0f6 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -385,6 +385,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		break;
 #endif
 	case KVM_CAP_SPAPR_MULTITCE:
+	case KVM_CAP_SPAPR_TCE_IOMMU:
 		r = 1;
 		break;
 	default:
@@ -935,6 +936,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6c04da1..161e1d3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -641,6 +641,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS (0x100000 + 87)
 #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
 #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -885,6 +886,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 /* Available with KVM_CAP_PPC_RTAS */
 #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xdc, struct kvm_rtas_token_args)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
  2013-05-06  7:25 ` Alexey Kardashevskiy
  (?)
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   24 +++++++++++
 arch/powerpc/kvm/book3s_64_vio.c    |   79 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |   47 ++++++++++++++++++++-
 4 files changed, 149 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 2b70cbc..b6a047e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table {
 	u32 window_size;
 	bool virtmode_only;
 	struct iommu_group *grp;    /* used for IOMMU groups */
+	struct list_head hugepages; /* used for IOMMU groups */
+	spinlock_t hugepages_lock;  /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index bdfa140..3c95464 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -154,6 +154,30 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct iommu_kvmppc_hugepage {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long pa;	/* Base phys address used as a real TCE */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+};
+extern struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_find(
+		struct kvmppc_spapr_tce_table *tt, pte_t pte);
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 98cf949..274458d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -54,6 +54,59 @@ static bool kvmppc_tce_virt_only = false;
 module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
 
+#ifdef CONFIG_IOMMU_API
+/*
+ * Adds a new huge page descriptor to the list.
+ */
+static struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_add(
+		struct kvmppc_spapr_tce_table *tt,
+		pte_t pte, unsigned long va, unsigned long pg_size)
+{
+	int ret;
+	struct iommu_kvmppc_hugepage *hp;
+	struct page *p;
+
+	va = va & ~(pg_size - 1);
+	ret = get_user_pages_fast(va, 1, true/*write*/, &p);
+	if ((ret != 1) || !p)
+		return NULL;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp)
+		return NULL;
+
+	hp->page = p;
+	hp->pte = pte;
+	hp->pa = __pa((unsigned long) page_address(hp->page));
+	hp->size = pg_size;
+
+	spin_lock(&tt->hugepages_lock);
+	list_add(&hp->list, &tt->hugepages);
+	spin_unlock(&tt->hugepages_lock);
+
+	return hp;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+	INIT_LIST_HEAD(&tt->hugepages);
+	spin_lock_init(&tt->hugepages_lock);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+	struct iommu_kvmppc_hugepage *hp, *tmp;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) {
+		list_del(&hp->list);
+		put_page(hp->page);
+		kfree(hp);
+	}
+	spin_unlock(&tt->hugepages_lock);
+}
+#endif /* CONFIG_IOMMU_API */
+
 /*
  * TCE tables handlers.
  */
@@ -73,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 #ifdef CONFIG_IOMMU_API
 	if (stt->grp) {
 		iommu_group_put(stt->grp);
+		kvmppc_iommu_hugepages_cleanup(stt);
 	} else
 #endif
 		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -211,6 +265,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
 	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
+	kvmppc_iommu_hugepages_init(tt);
 	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
 
 	mutex_unlock(&kvm->lock);
@@ -259,6 +314,8 @@ static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
 {
 	int ret;
 	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+	struct iommu_kvmppc_hugepage *hp;
+	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = iommu_tce_put_param_check(tbl, ioba, tce);
 	if (ret)
@@ -268,7 +325,27 @@ static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
 	if (pg_size == PAGE_SIZE)
 		return iommu_put_tce_user_mode(tbl, entry, tce);
 
-	return -EAGAIN;
+	/*
+	 * Hugepages case - manage the hugepage list.
+	 * kvmppc_iommu_hugepage_find() may find a huge page if called
+	 * from h_put_tce_indirect call.
+	 */
+	hp = kvmppc_iommu_hugepage_find(tt, pte);
+	if (!hp) {
+		/* This is the first time usage of this huge page */
+		hp = kvmppc_iommu_hugepage_add(tt, pte, tce, pg_size);
+		if (!hp)
+			return -EFAULT;
+	}
+
+	tce = (unsigned long) __va(hp->pa) + (tce & (pg_size - 1));
+
+	ret = iommu_tce_build(tbl, entry, tce, direction);
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
 }
 
 static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c5e5905..a91ff7b 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -43,6 +43,29 @@
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR      (~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/*
+ * Huge pages trick helper.
+ */
+struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_find(
+		struct kvmppc_spapr_tce_table *tt, pte_t pte)
+{
+	struct iommu_kvmppc_hugepage *hp, *ret = NULL;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry(hp, &tt->hugepages, list) {
+		if (hp->pte == pte) {
+			ret = hp;
+			break;
+		}
+	}
+	spin_unlock(&tt->hugepages_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvmppc_iommu_hugepage_find);
+#endif /* CONFIG_IOMMU_API */
+
 /*
  * Finds a TCE table descriptor by LIOBN.
  */
@@ -191,6 +214,15 @@ static int clear_tce_real_mode(struct iommu_table *tbl,
 		if (oldtce & TCE_PCI_WRITE)
 			SetPageDirty(page);
 
+		/*
+		 * As get_page is called only once on a HUGE page,
+		 * and it is done in virtual mode,
+		 * we do not release it here, instead we postpone it
+		 * till the KVM exit.
+		 */
+		if (PageCompound(page))
+			continue;
+
 		ret = realmode_put_page(page);
 		if (ret)
 			break;
@@ -210,14 +242,25 @@ static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
 	int ret;
 	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
 	struct page *page = NULL;
+	struct iommu_kvmppc_hugepage *hp = NULL;
 	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = iommu_tce_put_param_check(tbl, ioba, tce);
 	if (ret)
 		return ret;
 
-	if (pg_size != PAGE_SIZE)
-		return -EAGAIN;
+	/* This is a huge page. we continue only if it is already in the list */
+	if (pg_size != PAGE_SIZE) {
+		hp = kvmppc_iommu_hugepage_find(tt, pte);
+
+		/* Go to virtual mode to add a hugepage to the list if not found */
+		if (!hp)
+			return -EAGAIN;
+
+		/* tce_build receives a kernel virtual addresses */
+		return iommu_tce_build(tbl, entry, (unsigned long) __va(tce),
+				direction);
+	}
 
 	/* Small page case, find page struct to increment a counter */
 	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	Alex Williamson, Paul Mackerras, Joerg Roedel, David Gibson

This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   24 +++++++++++
 arch/powerpc/kvm/book3s_64_vio.c    |   79 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |   47 ++++++++++++++++++++-
 4 files changed, 149 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 2b70cbc..b6a047e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table {
 	u32 window_size;
 	bool virtmode_only;
 	struct iommu_group *grp;    /* used for IOMMU groups */
+	struct list_head hugepages; /* used for IOMMU groups */
+	spinlock_t hugepages_lock;  /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index bdfa140..3c95464 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -154,6 +154,30 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct iommu_kvmppc_hugepage {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long pa;	/* Base phys address used as a real TCE */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+};
+extern struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_find(
+		struct kvmppc_spapr_tce_table *tt, pte_t pte);
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 98cf949..274458d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -54,6 +54,59 @@ static bool kvmppc_tce_virt_only = false;
 module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
 
+#ifdef CONFIG_IOMMU_API
+/*
+ * Adds a new huge page descriptor to the list.
+ */
+static struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_add(
+		struct kvmppc_spapr_tce_table *tt,
+		pte_t pte, unsigned long va, unsigned long pg_size)
+{
+	int ret;
+	struct iommu_kvmppc_hugepage *hp;
+	struct page *p;
+
+	va = va & ~(pg_size - 1);
+	ret = get_user_pages_fast(va, 1, true/*write*/, &p);
+	if ((ret != 1) || !p)
+		return NULL;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp)
+		return NULL;
+
+	hp->page = p;
+	hp->pte = pte;
+	hp->pa = __pa((unsigned long) page_address(hp->page));
+	hp->size = pg_size;
+
+	spin_lock(&tt->hugepages_lock);
+	list_add(&hp->list, &tt->hugepages);
+	spin_unlock(&tt->hugepages_lock);
+
+	return hp;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+	INIT_LIST_HEAD(&tt->hugepages);
+	spin_lock_init(&tt->hugepages_lock);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+	struct iommu_kvmppc_hugepage *hp, *tmp;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) {
+		list_del(&hp->list);
+		put_page(hp->page);
+		kfree(hp);
+	}
+	spin_unlock(&tt->hugepages_lock);
+}
+#endif /* CONFIG_IOMMU_API */
+
 /*
  * TCE tables handlers.
  */
@@ -73,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 #ifdef CONFIG_IOMMU_API
 	if (stt->grp) {
 		iommu_group_put(stt->grp);
+		kvmppc_iommu_hugepages_cleanup(stt);
 	} else
 #endif
 		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -211,6 +265,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
 	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
+	kvmppc_iommu_hugepages_init(tt);
 	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
 
 	mutex_unlock(&kvm->lock);
@@ -259,6 +314,8 @@ static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
 {
 	int ret;
 	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+	struct iommu_kvmppc_hugepage *hp;
+	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = iommu_tce_put_param_check(tbl, ioba, tce);
 	if (ret)
@@ -268,7 +325,27 @@ static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
 	if (pg_size == PAGE_SIZE)
 		return iommu_put_tce_user_mode(tbl, entry, tce);
 
-	return -EAGAIN;
+	/*
+	 * Hugepages case - manage the hugepage list.
+	 * kvmppc_iommu_hugepage_find() may find a huge page if called
+	 * from h_put_tce_indirect call.
+	 */
+	hp = kvmppc_iommu_hugepage_find(tt, pte);
+	if (!hp) {
+		/* This is the first time usage of this huge page */
+		hp = kvmppc_iommu_hugepage_add(tt, pte, tce, pg_size);
+		if (!hp)
+			return -EFAULT;
+	}
+
+	tce = (unsigned long) __va(hp->pa) + (tce & (pg_size - 1));
+
+	ret = iommu_tce_build(tbl, entry, tce, direction);
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
 }
 
 static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c5e5905..a91ff7b 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -43,6 +43,29 @@
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR      (~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/*
+ * Huge pages trick helper.
+ */
+struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_find(
+		struct kvmppc_spapr_tce_table *tt, pte_t pte)
+{
+	struct iommu_kvmppc_hugepage *hp, *ret = NULL;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry(hp, &tt->hugepages, list) {
+		if (hp->pte == pte) {
+			ret = hp;
+			break;
+		}
+	}
+	spin_unlock(&tt->hugepages_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvmppc_iommu_hugepage_find);
+#endif /* CONFIG_IOMMU_API */
+
 /*
  * Finds a TCE table descriptor by LIOBN.
  */
@@ -191,6 +214,15 @@ static int clear_tce_real_mode(struct iommu_table *tbl,
 		if (oldtce & TCE_PCI_WRITE)
 			SetPageDirty(page);
 
+		/*
+		 * As get_page is called only once on a HUGE page,
+		 * and it is done in virtual mode,
+		 * we do not release it here, instead we postpone it
+		 * till the KVM exit.
+		 */
+		if (PageCompound(page))
+			continue;
+
 		ret = realmode_put_page(page);
 		if (ret)
 			break;
@@ -210,14 +242,25 @@ static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
 	int ret;
 	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
 	struct page *page = NULL;
+	struct iommu_kvmppc_hugepage *hp = NULL;
 	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = iommu_tce_put_param_check(tbl, ioba, tce);
 	if (ret)
 		return ret;
 
-	if (pg_size != PAGE_SIZE)
-		return -EAGAIN;
+	/* This is a huge page. we continue only if it is already in the list */
+	if (pg_size != PAGE_SIZE) {
+		hp = kvmppc_iommu_hugepage_find(tt, pte);
+
+		/* Go to virtual mode to add a hugepage to the list if not found */
+		if (!hp)
+			return -EAGAIN;
+
+		/* tce_build receives a kernel virtual addresses */
+		return iommu_tce_build(tbl, entry, (unsigned long) __va(tce),
+				direction);
+	}
 
 	/* Small page case, find page struct to increment a counter */
 	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
@ 2013-05-06  7:25   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-06  7:25 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
	Alexander Graf, Paul Mackerras, Joerg Roedel, Alex Williamson,
	kvm, kvm-ppc

This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   24 +++++++++++
 arch/powerpc/kvm/book3s_64_vio.c    |   79 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |   47 ++++++++++++++++++++-
 4 files changed, 149 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 2b70cbc..b6a047e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table {
 	u32 window_size;
 	bool virtmode_only;
 	struct iommu_group *grp;    /* used for IOMMU groups */
+	struct list_head hugepages; /* used for IOMMU groups */
+	spinlock_t hugepages_lock;  /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index bdfa140..3c95464 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -154,6 +154,30 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct iommu_kvmppc_hugepage {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long pa;	/* Base phys address used as a real TCE */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+};
+extern struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_find(
+		struct kvmppc_spapr_tce_table *tt, pte_t pte);
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 98cf949..274458d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -54,6 +54,59 @@ static bool kvmppc_tce_virt_only = false;
 module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
 
+#ifdef CONFIG_IOMMU_API
+/*
+ * Adds a new huge page descriptor to the list.
+ */
+static struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_add(
+		struct kvmppc_spapr_tce_table *tt,
+		pte_t pte, unsigned long va, unsigned long pg_size)
+{
+	int ret;
+	struct iommu_kvmppc_hugepage *hp;
+	struct page *p;
+
+	va = va & ~(pg_size - 1);
+	ret = get_user_pages_fast(va, 1, true/*write*/, &p);
+	if ((ret != 1) || !p)
+		return NULL;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp)
+		return NULL;
+
+	hp->page = p;
+	hp->pte = pte;
+	hp->pa = __pa((unsigned long) page_address(hp->page));
+	hp->size = pg_size;
+
+	spin_lock(&tt->hugepages_lock);
+	list_add(&hp->list, &tt->hugepages);
+	spin_unlock(&tt->hugepages_lock);
+
+	return hp;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+	INIT_LIST_HEAD(&tt->hugepages);
+	spin_lock_init(&tt->hugepages_lock);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+	struct iommu_kvmppc_hugepage *hp, *tmp;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) {
+		list_del(&hp->list);
+		put_page(hp->page);
+		kfree(hp);
+	}
+	spin_unlock(&tt->hugepages_lock);
+}
+#endif /* CONFIG_IOMMU_API */
+
 /*
  * TCE tables handlers.
  */
@@ -73,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 #ifdef CONFIG_IOMMU_API
 	if (stt->grp) {
 		iommu_group_put(stt->grp);
+		kvmppc_iommu_hugepages_cleanup(stt);
 	} else
 #endif
 		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -211,6 +265,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
 	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
+	kvmppc_iommu_hugepages_init(tt);
 	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
 
 	mutex_unlock(&kvm->lock);
@@ -259,6 +314,8 @@ static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
 {
 	int ret;
 	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+	struct iommu_kvmppc_hugepage *hp;
+	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = iommu_tce_put_param_check(tbl, ioba, tce);
 	if (ret)
@@ -268,7 +325,27 @@ static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
 	if (pg_size = PAGE_SIZE)
 		return iommu_put_tce_user_mode(tbl, entry, tce);
 
-	return -EAGAIN;
+	/*
+	 * Hugepages case - manage the hugepage list.
+	 * kvmppc_iommu_hugepage_find() may find a huge page if called
+	 * from h_put_tce_indirect call.
+	 */
+	hp = kvmppc_iommu_hugepage_find(tt, pte);
+	if (!hp) {
+		/* This is the first time usage of this huge page */
+		hp = kvmppc_iommu_hugepage_add(tt, pte, tce, pg_size);
+		if (!hp)
+			return -EFAULT;
+	}
+
+	tce = (unsigned long) __va(hp->pa) + (tce & (pg_size - 1));
+
+	ret = iommu_tce_build(tbl, entry, tce, direction);
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
 }
 
 static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c5e5905..a91ff7b 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -43,6 +43,29 @@
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR      (~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/*
+ * Huge pages trick helper.
+ */
+struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_find(
+		struct kvmppc_spapr_tce_table *tt, pte_t pte)
+{
+	struct iommu_kvmppc_hugepage *hp, *ret = NULL;
+
+	spin_lock(&tt->hugepages_lock);
+	list_for_each_entry(hp, &tt->hugepages, list) {
+		if (hp->pte = pte) {
+			ret = hp;
+			break;
+		}
+	}
+	spin_unlock(&tt->hugepages_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvmppc_iommu_hugepage_find);
+#endif /* CONFIG_IOMMU_API */
+
 /*
  * Finds a TCE table descriptor by LIOBN.
  */
@@ -191,6 +214,15 @@ static int clear_tce_real_mode(struct iommu_table *tbl,
 		if (oldtce & TCE_PCI_WRITE)
 			SetPageDirty(page);
 
+		/*
+		 * As get_page is called only once on a HUGE page,
+		 * and it is done in virtual mode,
+		 * we do not release it here, instead we postpone it
+		 * till the KVM exit.
+		 */
+		if (PageCompound(page))
+			continue;
+
 		ret = realmode_put_page(page);
 		if (ret)
 			break;
@@ -210,14 +242,25 @@ static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
 	int ret;
 	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
 	struct page *page = NULL;
+	struct iommu_kvmppc_hugepage *hp = NULL;
 	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = iommu_tce_put_param_check(tbl, ioba, tce);
 	if (ret)
 		return ret;
 
-	if (pg_size != PAGE_SIZE)
-		return -EAGAIN;
+	/* This is a huge page. we continue only if it is already in the list */
+	if (pg_size != PAGE_SIZE) {
+		hp = kvmppc_iommu_hugepage_find(tt, pte);
+
+		/* Go to virtual mode to add a hugepage to the list if not found */
+		if (!hp)
+			return -EAGAIN;
+
+		/* tce_build receives a kernel virtual addresses */
+		return iommu_tce_build(tbl, entry, (unsigned long) __va(tce),
+				direction);
+	}
 
 	/* Small page case, find page struct to increment a counter */
 	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
  2013-05-06  7:25   ` Alexey Kardashevskiy
  (?)
@ 2013-05-07  5:05     ` David Gibson
  -1 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  5:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Fwiw, it would be nice to get this patch merged, regardless of the
rest of the VFIO/powerpc patches.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
@ 2013-05-07  5:05     ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  5:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Fwiw, it would be nice to get this patch merged, regardless of the
rest of the VFIO/powerpc patches.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
@ 2013-05-07  5:05     ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  5:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Fwiw, it would be nice to get this patch merged, regardless of the
rest of the VFIO/powerpc patches.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-06  7:25   ` Alexey Kardashevskiy
  (?)
@ 2013-05-07  5:29     ` David Gibson
  -1 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  5:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 21989 bytes --]

On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests without passing them to QEMU, which should
> save time on switching to QEMU and back.
> 
> Both real and virtual modes are supported - whenever the kernel
> fails to handle TCE request, it passes it to the virtual mode.
> If it the virtual mode handlers fail, then the request is passed
> to the user mode, for example, to QEMU.
> 
> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> in-kernel handling of IOMMU map/unmap.
> 
> This adds a special case for huge pages (16MB).  The reference
> counting cannot be easily done for such pages in real mode (when
> MMU is off) so we added a list of huge pages.  It is populated in
> virtual mode and get_page is called just once per a huge page.
> Real mode handlers check if the requested page is huge and in the list,
> then no reference counting is done, otherwise an exit to virtual mode
> happens.  The list is released at KVM exit.  At the moment the fastest
> card available for tests uses up to 9 huge pages so walking through this
> list is not very expensive.  However this can change and we may want
> to optimize this.
> 
> This also adds the virt_only parameter to the KVM module
> for debug and performance check purposes.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  Documentation/virtual/kvm/api.txt   |   28 ++++
>  arch/powerpc/include/asm/kvm_host.h |    2 +
>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c          |   12 ++
>  include/uapi/linux/kvm.h            |    2 +
>  8 files changed, 485 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index f621cd6..2039767 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
>  
> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> +Returns: 0 on success, -1 on error
> +
> +This creates a link between IOMMU group and a hardware TCE (translation
> +control entry) table. This link lets the host kernel know what IOMMU
> +group (i.e. TCE table) to use for the LIOBN number passed with
> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> +
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;

Wouldn't it be more in keeping 

> +	__u32 flags;
> +};
> +
> +No flag is supported at the moment.
> +
> +When the guest issues TCE call on a liobn for which a TCE table has been
> +registered, the kernel will handle it in real mode, updating the hardware
> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> +be handled by userspace.
> +
> +
>  5. The kvm_run structure
>  ------------------------
>  
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 36ceb0d..2b70cbc 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>  	struct kvm *kvm;
>  	u64 liobn;
>  	u32 window_size;
> +	bool virtmode_only;

I see this is now initialized from the global parameter, but I think
it would be better to just check the global (debug) parameter
directly, rather than duplicating it here.

> +	struct iommu_group *grp;    /* used for IOMMU groups */
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index d501246..bdfa140 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce *args);
> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +				struct kvm_create_spapr_tce_iommu *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 681b314..b67d44b 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>  	__u32 window_size;
>  };
>  
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;
> +	__u32 flags;
> +};
> +
>  /* for KVM_ALLOCATE_RMA */
>  struct kvm_allocate_rma {
>  	__u64 rma_size;
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 643ac1e..98cf949 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,9 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/pci.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -38,10 +41,19 @@
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
> +#include <asm/tce.h>
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> +#define DRIVER_DESC	"POWERPC KVM driver"

Really?

>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>  #define ERROR_ADDR      (~(unsigned long)0x0)
>  
> +static bool kvmppc_tce_virt_only = false;
> +module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
> +
>  /*
>   * TCE tables handlers.
>   */
> @@ -58,8 +70,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>  
>  	mutex_lock(&kvm->lock);
>  	list_del(&stt->list);
> -	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> -		__free_page(stt->pages[i]);
> +#ifdef CONFIG_IOMMU_API
> +	if (stt->grp) {
> +		iommu_group_put(stt->grp);
> +	} else
> +#endif
> +		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> +			__free_page(stt->pages[i]);
>  	kfree(stt);
>  	mutex_unlock(&kvm->lock);
>  
> @@ -155,9 +172,127 @@ fail:
>  	return ret;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static const struct file_operations kvm_spapr_tce_iommu_fops = {
> +	.release	= kvm_spapr_tce_release,
> +};
> +
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +		struct kvm_create_spapr_tce_iommu *args)
> +{
> +	struct kvmppc_spapr_tce_table *tt = NULL;
> +	struct iommu_group *grp;
> +	struct iommu_table *tbl;
> +
> +	/* Find an IOMMU table for the given ID */
> +	grp = iommu_group_get_by_id(args->iommu_id);
> +	if (!grp)
> +		return -ENXIO;
> +
> +	tbl = iommu_group_get_iommudata(grp);
> +	if (!tbl)
> +		return -ENXIO;
> +
> +	/* Check this LIOBN hasn't been previously allocated */
> +	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == args->liobn)
> +			return -EBUSY;
> +	}
> +
> +	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
> +	if (!tt)
> +		return -ENOMEM;
> +
> +	tt->liobn = args->liobn;
> +	tt->kvm = kvm;
> +	tt->virtmode_only = kvmppc_tce_virt_only;
> +	tt->grp = grp;
> +
> +	kvm_get_kvm(kvm);
> +
> +	mutex_lock(&kvm->lock);
> +	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
> +
> +	mutex_unlock(&kvm->lock);
> +
> +	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
> +			args->liobn, args->iommu_id, args->flags);
> +
> +	return anon_inode_getfd("kvm-spapr-tce-iommu",
> +			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
> +}
> +#else
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +		struct kvm_create_spapr_tce_iommu *args)
> +{
> +	return -ENOSYS;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
> +#ifdef CONFIG_IOMMU_API
>  /*
>   * Virtual mode handling of IOMMU map/unmap.
>   */
> +static int clear_tce_virt_mode(struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce_value,
> +		unsigned long npages)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (ret)
> +		return ret;
> +
> +	ret = iommu_clear_tces_and_put_pages(tbl, entry, npages);
> +	if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
> +				__func__, ioba, tce_value, ret);
> +
> +	return ret;
> +}
> +
> +static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
> +		struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce,
> +		pte_t pte, unsigned long pg_size)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_put_param_check(tbl, ioba, tce);
> +	if (ret)
> +		return ret;
> +
> +	/* System page size case, easy to handle */
> +	if (pg_size == PAGE_SIZE)
> +		return iommu_put_tce_user_mode(tbl, entry, tce);
> +
> +	return -EAGAIN;
> +}
> +
> +static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
> +		unsigned long hva, bool writing, unsigned long *pg_sizep)
> +{
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	/* Find out the page pte and size if requested */
> +	pte_t pte;
> +	unsigned long pg_size = 0;
> +
> +	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte_present(pte))
> +		return 0;
> +
> +	*pg_sizep = pg_size;
> +
> +	return pte;
> +#else
> +	return 0;
> +#endif
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  /* Converts guest physical address into host virtual */
>  static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
>  		unsigned long gpa)
> @@ -188,6 +323,43 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte;
> +
> +			hpa = get_virt_address(vcpu, tce);
> +			if (hpa == ERROR_ADDR)
> +				return -EFAULT;
> +
> +			pte = va_to_linux_pte(vcpu, hpa, tce & TCE_PCI_WRITE,
> +					&pg_size);
> +			if (!pte)
> +				return -EFAULT;
> +
> +			ret = put_tce_virt_mode(tt, tbl, ioba, hpa,
> +					pte, pg_size);
> +		} else {
> +			ret = clear_tce_virt_mode(tbl, ioba, 0, 1);
> +		}
> +		iommu_flush_tce(tbl);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>  }
> @@ -213,6 +385,52 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (tces == ERROR_ADDR)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret = 0;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		for (i = 0; i < npages; ++i) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +			unsigned long tce;
> +			unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +			if (get_user(tce, (unsigned long __user *)ptce))
> +				break;
> +
> +			hpa = get_virt_address(vcpu, tce);
> +			if (hpa == ERROR_ADDR)
> +				return -EFAULT;
> +
> +			pte = va_to_linux_pte(vcpu, hpa,
> +					tce & TCE_PCI_WRITE, &pg_size);
> +			if (!pte)
> +				return -EFAULT;
> +
> +			ret = put_tce_virt_mode(tt, tbl,
> +					ioba + (i << IOMMU_PAGE_SHIFT),
> +					hpa, pte, pg_size);
> +			if (ret)
> +				break;
> +		}
> +		if (ret)
> +			clear_tce_virt_mode(tbl, ioba, 0, i);
> +
> +		iommu_flush_tce(tbl);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> @@ -253,6 +471,26 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		ret = clear_tce_virt_mode(tbl, ioba,
> +				tce_value, npages);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 55fdf7a..c5e5905 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -161,6 +162,85 @@ static unsigned long get_real_address(struct kvm_vcpu *vcpu,
>  	return hwaddr;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static int clear_tce_real_mode(struct iommu_table *tbl,
> +		unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (ret)
> +		return ret;
> +
> +	for ( ; npages; --npages, ++entry) {
> +		struct page *page;
> +		unsigned long oldtce;
> +
> +		oldtce = iommu_clear_tce(tbl, entry);
> +		if (!oldtce)
> +			continue;
> +
> +		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
> +		if (!page) {
> +			ret = -EAGAIN;
> +			break;
> +		}
> +
> +		if (oldtce & TCE_PCI_WRITE)
> +			SetPageDirty(page);
> +
> +		ret = realmode_put_page(page);
> +		if (ret)
> +			break;
> +	}
> +	/* if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
> +				__func__, ioba, tce_value, ret); */
> +
> +	return ret;
> +}
> +
> +static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
> +		struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce,
> +		pte_t pte, unsigned long pg_size)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +	struct page *page = NULL;
> +	enum dma_data_direction direction = iommu_tce_direction(tce);
> +
> +	ret = iommu_tce_put_param_check(tbl, ioba, tce);
> +	if (ret)
> +		return ret;
> +
> +	if (pg_size != PAGE_SIZE)
> +		return -EAGAIN;
> +
> +	/* Small page case, find page struct to increment a counter */
> +	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
> +	if (!page)
> +		return -EAGAIN;
> +
> +	ret = realmode_get_page(page);
> +	if (ret)
> +		return ret;
> +
> +	/* tce_build accepts virtual addresses */
> +	ret = iommu_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
> +	if (ret)
> +		realmode_put_page(page);
> +
> +	/* if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
> +				__func__, ioba, tce, ret); */
> +
> +	return ret;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> @@ -171,6 +251,44 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +
> +			hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
> +					&pte, &pg_size);
> +			if (hpa == ERROR_ADDR)
> +				return H_TOO_HARD;
> +
> +			ret = put_tce_real_mode(tt, tbl, ioba,
> +					hpa, pte, pg_size);
> +		} else {
> +			ret = clear_tce_real_mode(tbl, ioba, 0, 1);
> +		}
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>  }
> @@ -192,10 +310,58 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
>  	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
>  	if (tces == ERROR_ADDR)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret = 0;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		for (i = 0; i < npages; ++i) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +			unsigned long tce;
> +			unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +			if (get_user(tce, (unsigned long __user *)ptce))
> +				break;
> +
> +			hpa = get_real_address(vcpu, tce,
> +					tce & TCE_PCI_WRITE,
> +					&pte, &pg_size);
> +			if (hpa == ERROR_ADDR)
> +				ret = -EAGAIN;
> +			else
> +				ret = put_tce_real_mode(tt, tbl,
> +						ioba + (i << IOMMU_PAGE_SHIFT),
> +						hpa, pte, pg_size);
> +			if (ret)
> +				break;
> +		}
> +		if (ret)
> +			clear_tce_real_mode(tbl, ioba, 0, i);
> +
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> @@ -236,6 +402,32 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		ret = clear_tce_real_mode(tbl, ioba,
> +				tce_value, npages);
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index b7ad589..269b0f6 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -385,6 +385,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		break;
>  #endif
>  	case KVM_CAP_SPAPR_MULTITCE:
> +	case KVM_CAP_SPAPR_TCE_IOMMU:
>  		r = 1;
>  		break;
>  	default:
> @@ -935,6 +936,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>  		goto out;
>  	}
> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
> +		struct kvm *kvm = filp->private_data;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&create_tce_iommu, argp,
> +				sizeof(create_tce_iommu)))
> +			goto out;
> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
> +		goto out;
> +	}
>  #endif /* CONFIG_PPC_BOOK3S_64 */
>  
>  #ifdef CONFIG_KVM_BOOK3S_64_HV
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c04da1..161e1d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -641,6 +641,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_RTAS (0x100000 + 87)
>  #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
>  #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -885,6 +886,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
>  /* Available with KVM_CAP_PPC_RTAS */
>  #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xdc, struct kvm_rtas_token_args)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
>  
>  /*
>   * ioctls for vcpu fds

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  5:29     ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  5:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 21989 bytes --]

On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests without passing them to QEMU, which should
> save time on switching to QEMU and back.
> 
> Both real and virtual modes are supported - whenever the kernel
> fails to handle TCE request, it passes it to the virtual mode.
> If it the virtual mode handlers fail, then the request is passed
> to the user mode, for example, to QEMU.
> 
> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> in-kernel handling of IOMMU map/unmap.
> 
> This adds a special case for huge pages (16MB).  The reference
> counting cannot be easily done for such pages in real mode (when
> MMU is off) so we added a list of huge pages.  It is populated in
> virtual mode and get_page is called just once per a huge page.
> Real mode handlers check if the requested page is huge and in the list,
> then no reference counting is done, otherwise an exit to virtual mode
> happens.  The list is released at KVM exit.  At the moment the fastest
> card available for tests uses up to 9 huge pages so walking through this
> list is not very expensive.  However this can change and we may want
> to optimize this.
> 
> This also adds the virt_only parameter to the KVM module
> for debug and performance check purposes.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  Documentation/virtual/kvm/api.txt   |   28 ++++
>  arch/powerpc/include/asm/kvm_host.h |    2 +
>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c          |   12 ++
>  include/uapi/linux/kvm.h            |    2 +
>  8 files changed, 485 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index f621cd6..2039767 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
>  
> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> +Returns: 0 on success, -1 on error
> +
> +This creates a link between IOMMU group and a hardware TCE (translation
> +control entry) table. This link lets the host kernel know what IOMMU
> +group (i.e. TCE table) to use for the LIOBN number passed with
> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> +
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;

Wouldn't it be more in keeping 

> +	__u32 flags;
> +};
> +
> +No flag is supported at the moment.
> +
> +When the guest issues TCE call on a liobn for which a TCE table has been
> +registered, the kernel will handle it in real mode, updating the hardware
> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> +be handled by userspace.
> +
> +
>  5. The kvm_run structure
>  ------------------------
>  
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 36ceb0d..2b70cbc 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>  	struct kvm *kvm;
>  	u64 liobn;
>  	u32 window_size;
> +	bool virtmode_only;

I see this is now initialized from the global parameter, but I think
it would be better to just check the global (debug) parameter
directly, rather than duplicating it here.

> +	struct iommu_group *grp;    /* used for IOMMU groups */
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index d501246..bdfa140 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce *args);
> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +				struct kvm_create_spapr_tce_iommu *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 681b314..b67d44b 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>  	__u32 window_size;
>  };
>  
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;
> +	__u32 flags;
> +};
> +
>  /* for KVM_ALLOCATE_RMA */
>  struct kvm_allocate_rma {
>  	__u64 rma_size;
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 643ac1e..98cf949 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,9 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/pci.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -38,10 +41,19 @@
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
> +#include <asm/tce.h>
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> +#define DRIVER_DESC	"POWERPC KVM driver"

Really?

>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>  #define ERROR_ADDR      (~(unsigned long)0x0)
>  
> +static bool kvmppc_tce_virt_only = false;
> +module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
> +
>  /*
>   * TCE tables handlers.
>   */
> @@ -58,8 +70,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>  
>  	mutex_lock(&kvm->lock);
>  	list_del(&stt->list);
> -	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> -		__free_page(stt->pages[i]);
> +#ifdef CONFIG_IOMMU_API
> +	if (stt->grp) {
> +		iommu_group_put(stt->grp);
> +	} else
> +#endif
> +		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> +			__free_page(stt->pages[i]);
>  	kfree(stt);
>  	mutex_unlock(&kvm->lock);
>  
> @@ -155,9 +172,127 @@ fail:
>  	return ret;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static const struct file_operations kvm_spapr_tce_iommu_fops = {
> +	.release	= kvm_spapr_tce_release,
> +};
> +
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +		struct kvm_create_spapr_tce_iommu *args)
> +{
> +	struct kvmppc_spapr_tce_table *tt = NULL;
> +	struct iommu_group *grp;
> +	struct iommu_table *tbl;
> +
> +	/* Find an IOMMU table for the given ID */
> +	grp = iommu_group_get_by_id(args->iommu_id);
> +	if (!grp)
> +		return -ENXIO;
> +
> +	tbl = iommu_group_get_iommudata(grp);
> +	if (!tbl)
> +		return -ENXIO;
> +
> +	/* Check this LIOBN hasn't been previously allocated */
> +	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == args->liobn)
> +			return -EBUSY;
> +	}
> +
> +	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
> +	if (!tt)
> +		return -ENOMEM;
> +
> +	tt->liobn = args->liobn;
> +	tt->kvm = kvm;
> +	tt->virtmode_only = kvmppc_tce_virt_only;
> +	tt->grp = grp;
> +
> +	kvm_get_kvm(kvm);
> +
> +	mutex_lock(&kvm->lock);
> +	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
> +
> +	mutex_unlock(&kvm->lock);
> +
> +	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
> +			args->liobn, args->iommu_id, args->flags);
> +
> +	return anon_inode_getfd("kvm-spapr-tce-iommu",
> +			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
> +}
> +#else
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +		struct kvm_create_spapr_tce_iommu *args)
> +{
> +	return -ENOSYS;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
> +#ifdef CONFIG_IOMMU_API
>  /*
>   * Virtual mode handling of IOMMU map/unmap.
>   */
> +static int clear_tce_virt_mode(struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce_value,
> +		unsigned long npages)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (ret)
> +		return ret;
> +
> +	ret = iommu_clear_tces_and_put_pages(tbl, entry, npages);
> +	if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
> +				__func__, ioba, tce_value, ret);
> +
> +	return ret;
> +}
> +
> +static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
> +		struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce,
> +		pte_t pte, unsigned long pg_size)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_put_param_check(tbl, ioba, tce);
> +	if (ret)
> +		return ret;
> +
> +	/* System page size case, easy to handle */
> +	if (pg_size == PAGE_SIZE)
> +		return iommu_put_tce_user_mode(tbl, entry, tce);
> +
> +	return -EAGAIN;
> +}
> +
> +static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
> +		unsigned long hva, bool writing, unsigned long *pg_sizep)
> +{
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	/* Find out the page pte and size if requested */
> +	pte_t pte;
> +	unsigned long pg_size = 0;
> +
> +	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte_present(pte))
> +		return 0;
> +
> +	*pg_sizep = pg_size;
> +
> +	return pte;
> +#else
> +	return 0;
> +#endif
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  /* Converts guest physical address into host virtual */
>  static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
>  		unsigned long gpa)
> @@ -188,6 +323,43 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte;
> +
> +			hpa = get_virt_address(vcpu, tce);
> +			if (hpa == ERROR_ADDR)
> +				return -EFAULT;
> +
> +			pte = va_to_linux_pte(vcpu, hpa, tce & TCE_PCI_WRITE,
> +					&pg_size);
> +			if (!pte)
> +				return -EFAULT;
> +
> +			ret = put_tce_virt_mode(tt, tbl, ioba, hpa,
> +					pte, pg_size);
> +		} else {
> +			ret = clear_tce_virt_mode(tbl, ioba, 0, 1);
> +		}
> +		iommu_flush_tce(tbl);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>  }
> @@ -213,6 +385,52 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (tces == ERROR_ADDR)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret = 0;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		for (i = 0; i < npages; ++i) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +			unsigned long tce;
> +			unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +			if (get_user(tce, (unsigned long __user *)ptce))
> +				break;
> +
> +			hpa = get_virt_address(vcpu, tce);
> +			if (hpa == ERROR_ADDR)
> +				return -EFAULT;
> +
> +			pte = va_to_linux_pte(vcpu, hpa,
> +					tce & TCE_PCI_WRITE, &pg_size);
> +			if (!pte)
> +				return -EFAULT;
> +
> +			ret = put_tce_virt_mode(tt, tbl,
> +					ioba + (i << IOMMU_PAGE_SHIFT),
> +					hpa, pte, pg_size);
> +			if (ret)
> +				break;
> +		}
> +		if (ret)
> +			clear_tce_virt_mode(tbl, ioba, 0, i);
> +
> +		iommu_flush_tce(tbl);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> @@ -253,6 +471,26 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		ret = clear_tce_virt_mode(tbl, ioba,
> +				tce_value, npages);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 55fdf7a..c5e5905 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -161,6 +162,85 @@ static unsigned long get_real_address(struct kvm_vcpu *vcpu,
>  	return hwaddr;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static int clear_tce_real_mode(struct iommu_table *tbl,
> +		unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (ret)
> +		return ret;
> +
> +	for ( ; npages; --npages, ++entry) {
> +		struct page *page;
> +		unsigned long oldtce;
> +
> +		oldtce = iommu_clear_tce(tbl, entry);
> +		if (!oldtce)
> +			continue;
> +
> +		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
> +		if (!page) {
> +			ret = -EAGAIN;
> +			break;
> +		}
> +
> +		if (oldtce & TCE_PCI_WRITE)
> +			SetPageDirty(page);
> +
> +		ret = realmode_put_page(page);
> +		if (ret)
> +			break;
> +	}
> +	/* if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
> +				__func__, ioba, tce_value, ret); */
> +
> +	return ret;
> +}
> +
> +static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
> +		struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce,
> +		pte_t pte, unsigned long pg_size)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +	struct page *page = NULL;
> +	enum dma_data_direction direction = iommu_tce_direction(tce);
> +
> +	ret = iommu_tce_put_param_check(tbl, ioba, tce);
> +	if (ret)
> +		return ret;
> +
> +	if (pg_size != PAGE_SIZE)
> +		return -EAGAIN;
> +
> +	/* Small page case, find page struct to increment a counter */
> +	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
> +	if (!page)
> +		return -EAGAIN;
> +
> +	ret = realmode_get_page(page);
> +	if (ret)
> +		return ret;
> +
> +	/* tce_build accepts virtual addresses */
> +	ret = iommu_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
> +	if (ret)
> +		realmode_put_page(page);
> +
> +	/* if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
> +				__func__, ioba, tce, ret); */
> +
> +	return ret;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> @@ -171,6 +251,44 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +
> +			hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
> +					&pte, &pg_size);
> +			if (hpa == ERROR_ADDR)
> +				return H_TOO_HARD;
> +
> +			ret = put_tce_real_mode(tt, tbl, ioba,
> +					hpa, pte, pg_size);
> +		} else {
> +			ret = clear_tce_real_mode(tbl, ioba, 0, 1);
> +		}
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>  }
> @@ -192,10 +310,58 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
>  	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
>  	if (tces == ERROR_ADDR)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret = 0;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		for (i = 0; i < npages; ++i) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +			unsigned long tce;
> +			unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +			if (get_user(tce, (unsigned long __user *)ptce))
> +				break;
> +
> +			hpa = get_real_address(vcpu, tce,
> +					tce & TCE_PCI_WRITE,
> +					&pte, &pg_size);
> +			if (hpa == ERROR_ADDR)
> +				ret = -EAGAIN;
> +			else
> +				ret = put_tce_real_mode(tt, tbl,
> +						ioba + (i << IOMMU_PAGE_SHIFT),
> +						hpa, pte, pg_size);
> +			if (ret)
> +				break;
> +		}
> +		if (ret)
> +			clear_tce_real_mode(tbl, ioba, 0, i);
> +
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> @@ -236,6 +402,32 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		ret = clear_tce_real_mode(tbl, ioba,
> +				tce_value, npages);
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index b7ad589..269b0f6 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -385,6 +385,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		break;
>  #endif
>  	case KVM_CAP_SPAPR_MULTITCE:
> +	case KVM_CAP_SPAPR_TCE_IOMMU:
>  		r = 1;
>  		break;
>  	default:
> @@ -935,6 +936,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>  		goto out;
>  	}
> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
> +		struct kvm *kvm = filp->private_data;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&create_tce_iommu, argp,
> +				sizeof(create_tce_iommu)))
> +			goto out;
> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
> +		goto out;
> +	}
>  #endif /* CONFIG_PPC_BOOK3S_64 */
>  
>  #ifdef CONFIG_KVM_BOOK3S_64_HV
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c04da1..161e1d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -641,6 +641,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_RTAS (0x100000 + 87)
>  #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
>  #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -885,6 +886,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
>  /* Available with KVM_CAP_PPC_RTAS */
>  #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xdc, struct kvm_rtas_token_args)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
>  
>  /*
>   * ioctls for vcpu fds

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  5:29     ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  5:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 21989 bytes --]

On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests without passing them to QEMU, which should
> save time on switching to QEMU and back.
> 
> Both real and virtual modes are supported - whenever the kernel
> fails to handle TCE request, it passes it to the virtual mode.
> If it the virtual mode handlers fail, then the request is passed
> to the user mode, for example, to QEMU.
> 
> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> in-kernel handling of IOMMU map/unmap.
> 
> This adds a special case for huge pages (16MB).  The reference
> counting cannot be easily done for such pages in real mode (when
> MMU is off) so we added a list of huge pages.  It is populated in
> virtual mode and get_page is called just once per a huge page.
> Real mode handlers check if the requested page is huge and in the list,
> then no reference counting is done, otherwise an exit to virtual mode
> happens.  The list is released at KVM exit.  At the moment the fastest
> card available for tests uses up to 9 huge pages so walking through this
> list is not very expensive.  However this can change and we may want
> to optimize this.
> 
> This also adds the virt_only parameter to the KVM module
> for debug and performance check purposes.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  Documentation/virtual/kvm/api.txt   |   28 ++++
>  arch/powerpc/include/asm/kvm_host.h |    2 +
>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c          |   12 ++
>  include/uapi/linux/kvm.h            |    2 +
>  8 files changed, 485 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index f621cd6..2039767 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
>  
> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> +Returns: 0 on success, -1 on error
> +
> +This creates a link between IOMMU group and a hardware TCE (translation
> +control entry) table. This link lets the host kernel know what IOMMU
> +group (i.e. TCE table) to use for the LIOBN number passed with
> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> +
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;

Wouldn't it be more in keeping 

> +	__u32 flags;
> +};
> +
> +No flag is supported at the moment.
> +
> +When the guest issues TCE call on a liobn for which a TCE table has been
> +registered, the kernel will handle it in real mode, updating the hardware
> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> +be handled by userspace.
> +
> +
>  5. The kvm_run structure
>  ------------------------
>  
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 36ceb0d..2b70cbc 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>  	struct kvm *kvm;
>  	u64 liobn;
>  	u32 window_size;
> +	bool virtmode_only;

I see this is now initialized from the global parameter, but I think
it would be better to just check the global (debug) parameter
directly, rather than duplicating it here.

> +	struct iommu_group *grp;    /* used for IOMMU groups */
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index d501246..bdfa140 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce *args);
> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +				struct kvm_create_spapr_tce_iommu *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 681b314..b67d44b 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>  	__u32 window_size;
>  };
>  
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;
> +	__u32 flags;
> +};
> +
>  /* for KVM_ALLOCATE_RMA */
>  struct kvm_allocate_rma {
>  	__u64 rma_size;
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 643ac1e..98cf949 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,9 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/pci.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -38,10 +41,19 @@
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
> +#include <asm/tce.h>
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> +#define DRIVER_DESC	"POWERPC KVM driver"

Really?

>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>  #define ERROR_ADDR      (~(unsigned long)0x0)
>  
> +static bool kvmppc_tce_virt_only = false;
> +module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
> +
>  /*
>   * TCE tables handlers.
>   */
> @@ -58,8 +70,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>  
>  	mutex_lock(&kvm->lock);
>  	list_del(&stt->list);
> -	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> -		__free_page(stt->pages[i]);
> +#ifdef CONFIG_IOMMU_API
> +	if (stt->grp) {
> +		iommu_group_put(stt->grp);
> +	} else
> +#endif
> +		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> +			__free_page(stt->pages[i]);
>  	kfree(stt);
>  	mutex_unlock(&kvm->lock);
>  
> @@ -155,9 +172,127 @@ fail:
>  	return ret;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static const struct file_operations kvm_spapr_tce_iommu_fops = {
> +	.release	= kvm_spapr_tce_release,
> +};
> +
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +		struct kvm_create_spapr_tce_iommu *args)
> +{
> +	struct kvmppc_spapr_tce_table *tt = NULL;
> +	struct iommu_group *grp;
> +	struct iommu_table *tbl;
> +
> +	/* Find an IOMMU table for the given ID */
> +	grp = iommu_group_get_by_id(args->iommu_id);
> +	if (!grp)
> +		return -ENXIO;
> +
> +	tbl = iommu_group_get_iommudata(grp);
> +	if (!tbl)
> +		return -ENXIO;
> +
> +	/* Check this LIOBN hasn't been previously allocated */
> +	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == args->liobn)
> +			return -EBUSY;
> +	}
> +
> +	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
> +	if (!tt)
> +		return -ENOMEM;
> +
> +	tt->liobn = args->liobn;
> +	tt->kvm = kvm;
> +	tt->virtmode_only = kvmppc_tce_virt_only;
> +	tt->grp = grp;
> +
> +	kvm_get_kvm(kvm);
> +
> +	mutex_lock(&kvm->lock);
> +	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
> +
> +	mutex_unlock(&kvm->lock);
> +
> +	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
> +			args->liobn, args->iommu_id, args->flags);
> +
> +	return anon_inode_getfd("kvm-spapr-tce-iommu",
> +			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
> +}
> +#else
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +		struct kvm_create_spapr_tce_iommu *args)
> +{
> +	return -ENOSYS;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
> +#ifdef CONFIG_IOMMU_API
>  /*
>   * Virtual mode handling of IOMMU map/unmap.
>   */
> +static int clear_tce_virt_mode(struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce_value,
> +		unsigned long npages)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (ret)
> +		return ret;
> +
> +	ret = iommu_clear_tces_and_put_pages(tbl, entry, npages);
> +	if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
> +				__func__, ioba, tce_value, ret);
> +
> +	return ret;
> +}
> +
> +static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
> +		struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce,
> +		pte_t pte, unsigned long pg_size)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_put_param_check(tbl, ioba, tce);
> +	if (ret)
> +		return ret;
> +
> +	/* System page size case, easy to handle */
> +	if (pg_size == PAGE_SIZE)
> +		return iommu_put_tce_user_mode(tbl, entry, tce);
> +
> +	return -EAGAIN;
> +}
> +
> +static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
> +		unsigned long hva, bool writing, unsigned long *pg_sizep)
> +{
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	/* Find out the page pte and size if requested */
> +	pte_t pte;
> +	unsigned long pg_size = 0;
> +
> +	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte_present(pte))
> +		return 0;
> +
> +	*pg_sizep = pg_size;
> +
> +	return pte;
> +#else
> +	return 0;
> +#endif
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  /* Converts guest physical address into host virtual */
>  static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
>  		unsigned long gpa)
> @@ -188,6 +323,43 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte;
> +
> +			hpa = get_virt_address(vcpu, tce);
> +			if (hpa == ERROR_ADDR)
> +				return -EFAULT;
> +
> +			pte = va_to_linux_pte(vcpu, hpa, tce & TCE_PCI_WRITE,
> +					&pg_size);
> +			if (!pte)
> +				return -EFAULT;
> +
> +			ret = put_tce_virt_mode(tt, tbl, ioba, hpa,
> +					pte, pg_size);
> +		} else {
> +			ret = clear_tce_virt_mode(tbl, ioba, 0, 1);
> +		}
> +		iommu_flush_tce(tbl);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>  }
> @@ -213,6 +385,52 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (tces == ERROR_ADDR)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret = 0;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		for (i = 0; i < npages; ++i) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +			unsigned long tce;
> +			unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +			if (get_user(tce, (unsigned long __user *)ptce))
> +				break;
> +
> +			hpa = get_virt_address(vcpu, tce);
> +			if (hpa == ERROR_ADDR)
> +				return -EFAULT;
> +
> +			pte = va_to_linux_pte(vcpu, hpa,
> +					tce & TCE_PCI_WRITE, &pg_size);
> +			if (!pte)
> +				return -EFAULT;
> +
> +			ret = put_tce_virt_mode(tt, tbl,
> +					ioba + (i << IOMMU_PAGE_SHIFT),
> +					hpa, pte, pg_size);
> +			if (ret)
> +				break;
> +		}
> +		if (ret)
> +			clear_tce_virt_mode(tbl, ioba, 0, i);
> +
> +		iommu_flush_tce(tbl);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> @@ -253,6 +471,26 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		ret = clear_tce_virt_mode(tbl, ioba,
> +				tce_value, npages);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 55fdf7a..c5e5905 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -161,6 +162,85 @@ static unsigned long get_real_address(struct kvm_vcpu *vcpu,
>  	return hwaddr;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static int clear_tce_real_mode(struct iommu_table *tbl,
> +		unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (ret)
> +		return ret;
> +
> +	for ( ; npages; --npages, ++entry) {
> +		struct page *page;
> +		unsigned long oldtce;
> +
> +		oldtce = iommu_clear_tce(tbl, entry);
> +		if (!oldtce)
> +			continue;
> +
> +		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
> +		if (!page) {
> +			ret = -EAGAIN;
> +			break;
> +		}
> +
> +		if (oldtce & TCE_PCI_WRITE)
> +			SetPageDirty(page);
> +
> +		ret = realmode_put_page(page);
> +		if (ret)
> +			break;
> +	}
> +	/* if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
> +				__func__, ioba, tce_value, ret); */
> +
> +	return ret;
> +}
> +
> +static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
> +		struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce,
> +		pte_t pte, unsigned long pg_size)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +	struct page *page = NULL;
> +	enum dma_data_direction direction = iommu_tce_direction(tce);
> +
> +	ret = iommu_tce_put_param_check(tbl, ioba, tce);
> +	if (ret)
> +		return ret;
> +
> +	if (pg_size != PAGE_SIZE)
> +		return -EAGAIN;
> +
> +	/* Small page case, find page struct to increment a counter */
> +	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
> +	if (!page)
> +		return -EAGAIN;
> +
> +	ret = realmode_get_page(page);
> +	if (ret)
> +		return ret;
> +
> +	/* tce_build accepts virtual addresses */
> +	ret = iommu_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
> +	if (ret)
> +		realmode_put_page(page);
> +
> +	/* if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
> +				__func__, ioba, tce, ret); */
> +
> +	return ret;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> @@ -171,6 +251,44 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +
> +			hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
> +					&pte, &pg_size);
> +			if (hpa == ERROR_ADDR)
> +				return H_TOO_HARD;
> +
> +			ret = put_tce_real_mode(tt, tbl, ioba,
> +					hpa, pte, pg_size);
> +		} else {
> +			ret = clear_tce_real_mode(tbl, ioba, 0, 1);
> +		}
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>  }
> @@ -192,10 +310,58 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
>  	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
>  	if (tces == ERROR_ADDR)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret = 0;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		for (i = 0; i < npages; ++i) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +			unsigned long tce;
> +			unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +			if (get_user(tce, (unsigned long __user *)ptce))
> +				break;
> +
> +			hpa = get_real_address(vcpu, tce,
> +					tce & TCE_PCI_WRITE,
> +					&pte, &pg_size);
> +			if (hpa == ERROR_ADDR)
> +				ret = -EAGAIN;
> +			else
> +				ret = put_tce_real_mode(tt, tbl,
> +						ioba + (i << IOMMU_PAGE_SHIFT),
> +						hpa, pte, pg_size);
> +			if (ret)
> +				break;
> +		}
> +		if (ret)
> +			clear_tce_real_mode(tbl, ioba, 0, i);
> +
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> @@ -236,6 +402,32 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		ret = clear_tce_real_mode(tbl, ioba,
> +				tce_value, npages);
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index b7ad589..269b0f6 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -385,6 +385,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		break;
>  #endif
>  	case KVM_CAP_SPAPR_MULTITCE:
> +	case KVM_CAP_SPAPR_TCE_IOMMU:
>  		r = 1;
>  		break;
>  	default:
> @@ -935,6 +936,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>  		goto out;
>  	}
> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
> +		struct kvm *kvm = filp->private_data;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&create_tce_iommu, argp,
> +				sizeof(create_tce_iommu)))
> +			goto out;
> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
> +		goto out;
> +	}
>  #endif /* CONFIG_PPC_BOOK3S_64 */
>  
>  #ifdef CONFIG_KVM_BOOK3S_64_HV
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c04da1..161e1d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -641,6 +641,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_RTAS (0x100000 + 87)
>  #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
>  #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -885,6 +886,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
>  /* Available with KVM_CAP_PPC_RTAS */
>  #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xdc, struct kvm_rtas_token_args)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
>  
>  /*
>   * ioctls for vcpu fds

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-07  5:29     ` David Gibson
  (?)
@ 2013-05-07  5:51       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-07  5:51 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

On 05/07/2013 03:29 PM, David Gibson wrote:
> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests without passing them to QEMU, which should
>> save time on switching to QEMU and back.
>>
>> Both real and virtual modes are supported - whenever the kernel
>> fails to handle TCE request, it passes it to the virtual mode.
>> If it the virtual mode handlers fail, then the request is passed
>> to the user mode, for example, to QEMU.
>>
>> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
>> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
>> in-kernel handling of IOMMU map/unmap.
>>
>> This adds a special case for huge pages (16MB).  The reference
>> counting cannot be easily done for such pages in real mode (when
>> MMU is off) so we added a list of huge pages.  It is populated in
>> virtual mode and get_page is called just once per a huge page.
>> Real mode handlers check if the requested page is huge and in the list,
>> then no reference counting is done, otherwise an exit to virtual mode
>> happens.  The list is released at KVM exit.  At the moment the fastest
>> card available for tests uses up to 9 huge pages so walking through this
>> list is not very expensive.  However this can change and we may want
>> to optimize this.
>>
>> This also adds the virt_only parameter to the KVM module
>> for debug and performance check purposes.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Cc: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>> ---
>>  Documentation/virtual/kvm/api.txt   |   28 ++++
>>  arch/powerpc/include/asm/kvm_host.h |    2 +
>>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c          |   12 ++
>>  include/uapi/linux/kvm.h            |    2 +
>>  8 files changed, 485 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index f621cd6..2039767 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>>  valid entries found.
>>  
>>  
>> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
>> +
>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>> +Architectures: powerpc
>> +Type: vm ioctl
>> +Parameters: struct kvm_create_spapr_tce_iommu (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +This creates a link between IOMMU group and a hardware TCE (translation
>> +control entry) table. This link lets the host kernel know what IOMMU
>> +group (i.e. TCE table) to use for the LIOBN number passed with
>> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
>> +
>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>> +struct kvm_create_spapr_tce_iommu {
>> +	__u64 liobn;
>> +	__u32 iommu_id;
> 
> Wouldn't it be more in keeping 


pardon?



>> +	__u32 flags;
>> +};
>> +
>> +No flag is supported at the moment.
>> +
>> +When the guest issues TCE call on a liobn for which a TCE table has been
>> +registered, the kernel will handle it in real mode, updating the hardware
>> +TCE table. TCE table calls for other liobns will cause a vm exit and must
>> +be handled by userspace.
>> +
>> +
>>  5. The kvm_run structure
>>  ------------------------
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 36ceb0d..2b70cbc 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>>  	struct kvm *kvm;
>>  	u64 liobn;
>>  	u32 window_size;
>> +	bool virtmode_only;
> 
> I see this is now initialized from the global parameter, but I think
> it would be better to just check the global (debug) parameter
> directly, rather than duplicating it here.


The global parameter is in kvm.ko and the struct above is in the real mode
part which cannot go to the module.



>> +	struct iommu_group *grp;    /* used for IOMMU groups */
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index d501246..bdfa140 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce *args);
>> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>> +				struct kvm_create_spapr_tce_iommu *args);
>>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 681b314..b67d44b 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>>  	__u32 window_size;
>>  };
>>  
>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>> +struct kvm_create_spapr_tce_iommu {
>> +	__u64 liobn;
>> +	__u32 iommu_id;
>> +	__u32 flags;
>> +};
>> +
>>  /* for KVM_ALLOCATE_RMA */
>>  struct kvm_allocate_rma {
>>  	__u64 rma_size;
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 643ac1e..98cf949 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,9 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/pci.h>
>> +#include <linux/iommu.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -38,10 +41,19 @@
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>> +#include <asm/tce.h>
>> +
>> +#define DRIVER_VERSION	"0.1"
>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
>> +#define DRIVER_DESC	"POWERPC KVM driver"
> 
> Really?


What is wrong here?



-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  5:51       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-07  5:51 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

On 05/07/2013 03:29 PM, David Gibson wrote:
> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests without passing them to QEMU, which should
>> save time on switching to QEMU and back.
>>
>> Both real and virtual modes are supported - whenever the kernel
>> fails to handle TCE request, it passes it to the virtual mode.
>> If it the virtual mode handlers fail, then the request is passed
>> to the user mode, for example, to QEMU.
>>
>> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
>> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
>> in-kernel handling of IOMMU map/unmap.
>>
>> This adds a special case for huge pages (16MB).  The reference
>> counting cannot be easily done for such pages in real mode (when
>> MMU is off) so we added a list of huge pages.  It is populated in
>> virtual mode and get_page is called just once per a huge page.
>> Real mode handlers check if the requested page is huge and in the list,
>> then no reference counting is done, otherwise an exit to virtual mode
>> happens.  The list is released at KVM exit.  At the moment the fastest
>> card available for tests uses up to 9 huge pages so walking through this
>> list is not very expensive.  However this can change and we may want
>> to optimize this.
>>
>> This also adds the virt_only parameter to the KVM module
>> for debug and performance check purposes.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Cc: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>> ---
>>  Documentation/virtual/kvm/api.txt   |   28 ++++
>>  arch/powerpc/include/asm/kvm_host.h |    2 +
>>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c          |   12 ++
>>  include/uapi/linux/kvm.h            |    2 +
>>  8 files changed, 485 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index f621cd6..2039767 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>>  valid entries found.
>>  
>>  
>> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
>> +
>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>> +Architectures: powerpc
>> +Type: vm ioctl
>> +Parameters: struct kvm_create_spapr_tce_iommu (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +This creates a link between IOMMU group and a hardware TCE (translation
>> +control entry) table. This link lets the host kernel know what IOMMU
>> +group (i.e. TCE table) to use for the LIOBN number passed with
>> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
>> +
>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>> +struct kvm_create_spapr_tce_iommu {
>> +	__u64 liobn;
>> +	__u32 iommu_id;
> 
> Wouldn't it be more in keeping 


pardon?



>> +	__u32 flags;
>> +};
>> +
>> +No flag is supported at the moment.
>> +
>> +When the guest issues TCE call on a liobn for which a TCE table has been
>> +registered, the kernel will handle it in real mode, updating the hardware
>> +TCE table. TCE table calls for other liobns will cause a vm exit and must
>> +be handled by userspace.
>> +
>> +
>>  5. The kvm_run structure
>>  ------------------------
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 36ceb0d..2b70cbc 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>>  	struct kvm *kvm;
>>  	u64 liobn;
>>  	u32 window_size;
>> +	bool virtmode_only;
> 
> I see this is now initialized from the global parameter, but I think
> it would be better to just check the global (debug) parameter
> directly, rather than duplicating it here.


The global parameter is in kvm.ko and the struct above is in the real mode
part which cannot go to the module.



>> +	struct iommu_group *grp;    /* used for IOMMU groups */
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index d501246..bdfa140 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce *args);
>> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>> +				struct kvm_create_spapr_tce_iommu *args);
>>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 681b314..b67d44b 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>>  	__u32 window_size;
>>  };
>>  
>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>> +struct kvm_create_spapr_tce_iommu {
>> +	__u64 liobn;
>> +	__u32 iommu_id;
>> +	__u32 flags;
>> +};
>> +
>>  /* for KVM_ALLOCATE_RMA */
>>  struct kvm_allocate_rma {
>>  	__u64 rma_size;
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 643ac1e..98cf949 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,9 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/pci.h>
>> +#include <linux/iommu.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -38,10 +41,19 @@
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>> +#include <asm/tce.h>
>> +
>> +#define DRIVER_VERSION	"0.1"
>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
>> +#define DRIVER_DESC	"POWERPC KVM driver"
> 
> Really?


What is wrong here?



-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  5:51       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-07  5:51 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

On 05/07/2013 03:29 PM, David Gibson wrote:
> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests without passing them to QEMU, which should
>> save time on switching to QEMU and back.
>>
>> Both real and virtual modes are supported - whenever the kernel
>> fails to handle TCE request, it passes it to the virtual mode.
>> If it the virtual mode handlers fail, then the request is passed
>> to the user mode, for example, to QEMU.
>>
>> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
>> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
>> in-kernel handling of IOMMU map/unmap.
>>
>> This adds a special case for huge pages (16MB).  The reference
>> counting cannot be easily done for such pages in real mode (when
>> MMU is off) so we added a list of huge pages.  It is populated in
>> virtual mode and get_page is called just once per a huge page.
>> Real mode handlers check if the requested page is huge and in the list,
>> then no reference counting is done, otherwise an exit to virtual mode
>> happens.  The list is released at KVM exit.  At the moment the fastest
>> card available for tests uses up to 9 huge pages so walking through this
>> list is not very expensive.  However this can change and we may want
>> to optimize this.
>>
>> This also adds the virt_only parameter to the KVM module
>> for debug and performance check purposes.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Cc: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>> ---
>>  Documentation/virtual/kvm/api.txt   |   28 ++++
>>  arch/powerpc/include/asm/kvm_host.h |    2 +
>>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c          |   12 ++
>>  include/uapi/linux/kvm.h            |    2 +
>>  8 files changed, 485 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index f621cd6..2039767 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>>  valid entries found.
>>  
>>  
>> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
>> +
>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>> +Architectures: powerpc
>> +Type: vm ioctl
>> +Parameters: struct kvm_create_spapr_tce_iommu (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +This creates a link between IOMMU group and a hardware TCE (translation
>> +control entry) table. This link lets the host kernel know what IOMMU
>> +group (i.e. TCE table) to use for the LIOBN number passed with
>> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
>> +
>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>> +struct kvm_create_spapr_tce_iommu {
>> +	__u64 liobn;
>> +	__u32 iommu_id;
> 
> Wouldn't it be more in keeping 


pardon?



>> +	__u32 flags;
>> +};
>> +
>> +No flag is supported at the moment.
>> +
>> +When the guest issues TCE call on a liobn for which a TCE table has been
>> +registered, the kernel will handle it in real mode, updating the hardware
>> +TCE table. TCE table calls for other liobns will cause a vm exit and must
>> +be handled by userspace.
>> +
>> +
>>  5. The kvm_run structure
>>  ------------------------
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 36ceb0d..2b70cbc 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>>  	struct kvm *kvm;
>>  	u64 liobn;
>>  	u32 window_size;
>> +	bool virtmode_only;
> 
> I see this is now initialized from the global parameter, but I think
> it would be better to just check the global (debug) parameter
> directly, rather than duplicating it here.


The global parameter is in kvm.ko and the struct above is in the real mode
part which cannot go to the module.



>> +	struct iommu_group *grp;    /* used for IOMMU groups */
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index d501246..bdfa140 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce *args);
>> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>> +				struct kvm_create_spapr_tce_iommu *args);
>>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 681b314..b67d44b 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>>  	__u32 window_size;
>>  };
>>  
>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>> +struct kvm_create_spapr_tce_iommu {
>> +	__u64 liobn;
>> +	__u32 iommu_id;
>> +	__u32 flags;
>> +};
>> +
>>  /* for KVM_ALLOCATE_RMA */
>>  struct kvm_allocate_rma {
>>  	__u64 rma_size;
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 643ac1e..98cf949 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,9 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/pci.h>
>> +#include <linux/iommu.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -38,10 +41,19 @@
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>> +#include <asm/tce.h>
>> +
>> +#define DRIVER_VERSION	"0.1"
>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
>> +#define DRIVER_DESC	"POWERPC KVM driver"
> 
> Really?


What is wrong here?



-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-07  5:51       ` Alexey Kardashevskiy
  (?)
@ 2013-05-07  6:02         ` David Gibson
  -1 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  6:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 7585 bytes --]

On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
> On 05/07/2013 03:29 PM, David Gibson wrote:
> > On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests without passing them to QEMU, which should
> >> save time on switching to QEMU and back.
> >>
> >> Both real and virtual modes are supported - whenever the kernel
> >> fails to handle TCE request, it passes it to the virtual mode.
> >> If it the virtual mode handlers fail, then the request is passed
> >> to the user mode, for example, to QEMU.
> >>
> >> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> >> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> >> in-kernel handling of IOMMU map/unmap.
> >>
> >> This adds a special case for huge pages (16MB).  The reference
> >> counting cannot be easily done for such pages in real mode (when
> >> MMU is off) so we added a list of huge pages.  It is populated in
> >> virtual mode and get_page is called just once per a huge page.
> >> Real mode handlers check if the requested page is huge and in the list,
> >> then no reference counting is done, otherwise an exit to virtual mode
> >> happens.  The list is released at KVM exit.  At the moment the fastest
> >> card available for tests uses up to 9 huge pages so walking through this
> >> list is not very expensive.  However this can change and we may want
> >> to optimize this.
> >>
> >> This also adds the virt_only parameter to the KVM module
> >> for debug and performance check purposes.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Cc: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> Signed-off-by: Paul Mackerras <paulus@samba.org>
> >> ---
> >>  Documentation/virtual/kvm/api.txt   |   28 ++++
> >>  arch/powerpc/include/asm/kvm_host.h |    2 +
> >>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
> >>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
> >>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c          |   12 ++
> >>  include/uapi/linux/kvm.h            |    2 +
> >>  8 files changed, 485 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> >> index f621cd6..2039767 100644
> >> --- a/Documentation/virtual/kvm/api.txt
> >> +++ b/Documentation/virtual/kvm/api.txt
> >> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
> >>  valid entries found.
> >>  
> >>  
> >> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
> >> +
> >> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> >> +Architectures: powerpc
> >> +Type: vm ioctl
> >> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> >> +Returns: 0 on success, -1 on error
> >> +
> >> +This creates a link between IOMMU group and a hardware TCE (translation
> >> +control entry) table. This link lets the host kernel know what IOMMU
> >> +group (i.e. TCE table) to use for the LIOBN number passed with
> >> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> >> +
> >> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> >> +struct kvm_create_spapr_tce_iommu {
> >> +	__u64 liobn;
> >> +	__u32 iommu_id;
> > 
> > Wouldn't it be more in keeping 
> 
> 
> pardon?

Sorry, I was going to suggest a change, but then realised it wasn't
actually any better than what you have now.

> >> +	__u32 flags;
> >> +};
> >> +
> >> +No flag is supported at the moment.
> >> +
> >> +When the guest issues TCE call on a liobn for which a TCE table has been
> >> +registered, the kernel will handle it in real mode, updating the hardware
> >> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> >> +be handled by userspace.
> >> +
> >> +
> >>  5. The kvm_run structure
> >>  ------------------------
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 36ceb0d..2b70cbc 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
> >>  	struct kvm *kvm;
> >>  	u64 liobn;
> >>  	u32 window_size;
> >> +	bool virtmode_only;
> > 
> > I see this is now initialized from the global parameter, but I think
> > it would be better to just check the global (debug) parameter
> > directly, rather than duplicating it here.
> 
> 
> The global parameter is in kvm.ko and the struct above is in the real mode
> part which cannot go to the module.

Ah, ok.  I'm half inclined to just drop the virtmode_only thing
entirely.

> >> +	struct iommu_group *grp;    /* used for IOMMU groups */
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index d501246..bdfa140 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce *args);
> >> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> >> +				struct kvm_create_spapr_tce_iommu *args);
> >>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> >>  		struct kvm_vcpu *vcpu, unsigned long liobn);
> >>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> >> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >> index 681b314..b67d44b 100644
> >> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
> >>  	__u32 window_size;
> >>  };
> >>  
> >> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> >> +struct kvm_create_spapr_tce_iommu {
> >> +	__u64 liobn;
> >> +	__u32 iommu_id;
> >> +	__u32 flags;
> >> +};
> >> +
> >>  /* for KVM_ALLOCATE_RMA */
> >>  struct kvm_allocate_rma {
> >>  	__u64 rma_size;
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 643ac1e..98cf949 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,9 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/pci.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -38,10 +41,19 @@
> >>  #include <asm/kvm_host.h>
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >> +#include <asm/tce.h>
> >> +
> >> +#define DRIVER_VERSION	"0.1"
> >> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> >> +#define DRIVER_DESC	"POWERPC KVM driver"
> > 
> > Really?
> 
> 
> What is wrong here?

Well, it seems entirely unrelated to the rest of the changes, and not
obviously accurate.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  6:02         ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  6:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 7585 bytes --]

On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
> On 05/07/2013 03:29 PM, David Gibson wrote:
> > On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests without passing them to QEMU, which should
> >> save time on switching to QEMU and back.
> >>
> >> Both real and virtual modes are supported - whenever the kernel
> >> fails to handle TCE request, it passes it to the virtual mode.
> >> If it the virtual mode handlers fail, then the request is passed
> >> to the user mode, for example, to QEMU.
> >>
> >> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> >> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> >> in-kernel handling of IOMMU map/unmap.
> >>
> >> This adds a special case for huge pages (16MB).  The reference
> >> counting cannot be easily done for such pages in real mode (when
> >> MMU is off) so we added a list of huge pages.  It is populated in
> >> virtual mode and get_page is called just once per a huge page.
> >> Real mode handlers check if the requested page is huge and in the list,
> >> then no reference counting is done, otherwise an exit to virtual mode
> >> happens.  The list is released at KVM exit.  At the moment the fastest
> >> card available for tests uses up to 9 huge pages so walking through this
> >> list is not very expensive.  However this can change and we may want
> >> to optimize this.
> >>
> >> This also adds the virt_only parameter to the KVM module
> >> for debug and performance check purposes.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Cc: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> Signed-off-by: Paul Mackerras <paulus@samba.org>
> >> ---
> >>  Documentation/virtual/kvm/api.txt   |   28 ++++
> >>  arch/powerpc/include/asm/kvm_host.h |    2 +
> >>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
> >>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
> >>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c          |   12 ++
> >>  include/uapi/linux/kvm.h            |    2 +
> >>  8 files changed, 485 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> >> index f621cd6..2039767 100644
> >> --- a/Documentation/virtual/kvm/api.txt
> >> +++ b/Documentation/virtual/kvm/api.txt
> >> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
> >>  valid entries found.
> >>  
> >>  
> >> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
> >> +
> >> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> >> +Architectures: powerpc
> >> +Type: vm ioctl
> >> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> >> +Returns: 0 on success, -1 on error
> >> +
> >> +This creates a link between IOMMU group and a hardware TCE (translation
> >> +control entry) table. This link lets the host kernel know what IOMMU
> >> +group (i.e. TCE table) to use for the LIOBN number passed with
> >> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> >> +
> >> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> >> +struct kvm_create_spapr_tce_iommu {
> >> +	__u64 liobn;
> >> +	__u32 iommu_id;
> > 
> > Wouldn't it be more in keeping 
> 
> 
> pardon?

Sorry, I was going to suggest a change, but then realised it wasn't
actually any better than what you have now.

> >> +	__u32 flags;
> >> +};
> >> +
> >> +No flag is supported at the moment.
> >> +
> >> +When the guest issues TCE call on a liobn for which a TCE table has been
> >> +registered, the kernel will handle it in real mode, updating the hardware
> >> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> >> +be handled by userspace.
> >> +
> >> +
> >>  5. The kvm_run structure
> >>  ------------------------
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 36ceb0d..2b70cbc 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
> >>  	struct kvm *kvm;
> >>  	u64 liobn;
> >>  	u32 window_size;
> >> +	bool virtmode_only;
> > 
> > I see this is now initialized from the global parameter, but I think
> > it would be better to just check the global (debug) parameter
> > directly, rather than duplicating it here.
> 
> 
> The global parameter is in kvm.ko and the struct above is in the real mode
> part which cannot go to the module.

Ah, ok.  I'm half inclined to just drop the virtmode_only thing
entirely.

> >> +	struct iommu_group *grp;    /* used for IOMMU groups */
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index d501246..bdfa140 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce *args);
> >> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> >> +				struct kvm_create_spapr_tce_iommu *args);
> >>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> >>  		struct kvm_vcpu *vcpu, unsigned long liobn);
> >>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> >> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >> index 681b314..b67d44b 100644
> >> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
> >>  	__u32 window_size;
> >>  };
> >>  
> >> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> >> +struct kvm_create_spapr_tce_iommu {
> >> +	__u64 liobn;
> >> +	__u32 iommu_id;
> >> +	__u32 flags;
> >> +};
> >> +
> >>  /* for KVM_ALLOCATE_RMA */
> >>  struct kvm_allocate_rma {
> >>  	__u64 rma_size;
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 643ac1e..98cf949 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,9 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/pci.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -38,10 +41,19 @@
> >>  #include <asm/kvm_host.h>
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >> +#include <asm/tce.h>
> >> +
> >> +#define DRIVER_VERSION	"0.1"
> >> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> >> +#define DRIVER_DESC	"POWERPC KVM driver"
> > 
> > Really?
> 
> 
> What is wrong here?

Well, it seems entirely unrelated to the rest of the changes, and not
obviously accurate.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  6:02         ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  6:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 7585 bytes --]

On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
> On 05/07/2013 03:29 PM, David Gibson wrote:
> > On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests without passing them to QEMU, which should
> >> save time on switching to QEMU and back.
> >>
> >> Both real and virtual modes are supported - whenever the kernel
> >> fails to handle TCE request, it passes it to the virtual mode.
> >> If it the virtual mode handlers fail, then the request is passed
> >> to the user mode, for example, to QEMU.
> >>
> >> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> >> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> >> in-kernel handling of IOMMU map/unmap.
> >>
> >> This adds a special case for huge pages (16MB).  The reference
> >> counting cannot be easily done for such pages in real mode (when
> >> MMU is off) so we added a list of huge pages.  It is populated in
> >> virtual mode and get_page is called just once per a huge page.
> >> Real mode handlers check if the requested page is huge and in the list,
> >> then no reference counting is done, otherwise an exit to virtual mode
> >> happens.  The list is released at KVM exit.  At the moment the fastest
> >> card available for tests uses up to 9 huge pages so walking through this
> >> list is not very expensive.  However this can change and we may want
> >> to optimize this.
> >>
> >> This also adds the virt_only parameter to the KVM module
> >> for debug and performance check purposes.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Cc: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> Signed-off-by: Paul Mackerras <paulus@samba.org>
> >> ---
> >>  Documentation/virtual/kvm/api.txt   |   28 ++++
> >>  arch/powerpc/include/asm/kvm_host.h |    2 +
> >>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
> >>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
> >>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c          |   12 ++
> >>  include/uapi/linux/kvm.h            |    2 +
> >>  8 files changed, 485 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> >> index f621cd6..2039767 100644
> >> --- a/Documentation/virtual/kvm/api.txt
> >> +++ b/Documentation/virtual/kvm/api.txt
> >> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
> >>  valid entries found.
> >>  
> >>  
> >> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
> >> +
> >> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> >> +Architectures: powerpc
> >> +Type: vm ioctl
> >> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> >> +Returns: 0 on success, -1 on error
> >> +
> >> +This creates a link between IOMMU group and a hardware TCE (translation
> >> +control entry) table. This link lets the host kernel know what IOMMU
> >> +group (i.e. TCE table) to use for the LIOBN number passed with
> >> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> >> +
> >> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> >> +struct kvm_create_spapr_tce_iommu {
> >> +	__u64 liobn;
> >> +	__u32 iommu_id;
> > 
> > Wouldn't it be more in keeping 
> 
> 
> pardon?

Sorry, I was going to suggest a change, but then realised it wasn't
actually any better than what you have now.

> >> +	__u32 flags;
> >> +};
> >> +
> >> +No flag is supported at the moment.
> >> +
> >> +When the guest issues TCE call on a liobn for which a TCE table has been
> >> +registered, the kernel will handle it in real mode, updating the hardware
> >> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> >> +be handled by userspace.
> >> +
> >> +
> >>  5. The kvm_run structure
> >>  ------------------------
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 36ceb0d..2b70cbc 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
> >>  	struct kvm *kvm;
> >>  	u64 liobn;
> >>  	u32 window_size;
> >> +	bool virtmode_only;
> > 
> > I see this is now initialized from the global parameter, but I think
> > it would be better to just check the global (debug) parameter
> > directly, rather than duplicating it here.
> 
> 
> The global parameter is in kvm.ko and the struct above is in the real mode
> part which cannot go to the module.

Ah, ok.  I'm half inclined to just drop the virtmode_only thing
entirely.

> >> +	struct iommu_group *grp;    /* used for IOMMU groups */
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index d501246..bdfa140 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce *args);
> >> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> >> +				struct kvm_create_spapr_tce_iommu *args);
> >>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> >>  		struct kvm_vcpu *vcpu, unsigned long liobn);
> >>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> >> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >> index 681b314..b67d44b 100644
> >> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
> >>  	__u32 window_size;
> >>  };
> >>  
> >> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> >> +struct kvm_create_spapr_tce_iommu {
> >> +	__u64 liobn;
> >> +	__u32 iommu_id;
> >> +	__u32 flags;
> >> +};
> >> +
> >>  /* for KVM_ALLOCATE_RMA */
> >>  struct kvm_allocate_rma {
> >>  	__u64 rma_size;
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 643ac1e..98cf949 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,9 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/pci.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -38,10 +41,19 @@
> >>  #include <asm/kvm_host.h>
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >> +#include <asm/tce.h>
> >> +
> >> +#define DRIVER_VERSION	"0.1"
> >> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> >> +#define DRIVER_DESC	"POWERPC KVM driver"
> > 
> > Really?
> 
> 
> What is wrong here?

Well, it seems entirely unrelated to the rest of the changes, and not
obviously accurate.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-07  6:02         ` David Gibson
  (?)
@ 2013-05-07  6:27           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-07  6:27 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

On 05/07/2013 04:02 PM, David Gibson wrote:
> On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
>> On 05/07/2013 03:29 PM, David Gibson wrote:
>>> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests without passing them to QEMU, which should
>>>> save time on switching to QEMU and back.
>>>>
>>>> Both real and virtual modes are supported - whenever the kernel
>>>> fails to handle TCE request, it passes it to the virtual mode.
>>>> If it the virtual mode handlers fail, then the request is passed
>>>> to the user mode, for example, to QEMU.
>>>>
>>>> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
>>>> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
>>>> in-kernel handling of IOMMU map/unmap.
>>>>
>>>> This adds a special case for huge pages (16MB).  The reference
>>>> counting cannot be easily done for such pages in real mode (when
>>>> MMU is off) so we added a list of huge pages.  It is populated in
>>>> virtual mode and get_page is called just once per a huge page.
>>>> Real mode handlers check if the requested page is huge and in the list,
>>>> then no reference counting is done, otherwise an exit to virtual mode
>>>> happens.  The list is released at KVM exit.  At the moment the fastest
>>>> card available for tests uses up to 9 huge pages so walking through this
>>>> list is not very expensive.  However this can change and we may want
>>>> to optimize this.
>>>>
>>>> This also adds the virt_only parameter to the KVM module
>>>> for debug and performance check purposes.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>> ---
>>>>  Documentation/virtual/kvm/api.txt   |   28 ++++
>>>>  arch/powerpc/include/asm/kvm_host.h |    2 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>>>>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>>>>  arch/powerpc/kvm/powerpc.c          |   12 ++
>>>>  include/uapi/linux/kvm.h            |    2 +
>>>>  8 files changed, 485 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index f621cd6..2039767 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>>>>  valid entries found.
>>>>  
>>>>  
>>>> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
>>>> +
>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>> +Architectures: powerpc
>>>> +Type: vm ioctl
>>>> +Parameters: struct kvm_create_spapr_tce_iommu (in)
>>>> +Returns: 0 on success, -1 on error
>>>> +
>>>> +This creates a link between IOMMU group and a hardware TCE (translation
>>>> +control entry) table. This link lets the host kernel know what IOMMU
>>>> +group (i.e. TCE table) to use for the LIOBN number passed with
>>>> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
>>>> +
>>>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>>>> +struct kvm_create_spapr_tce_iommu {
>>>> +	__u64 liobn;
>>>> +	__u32 iommu_id;
>>>
>>> Wouldn't it be more in keeping 
>>
>>
>> pardon?
> 
> Sorry, I was going to suggest a change, but then realised it wasn't
> actually any better than what you have now.
> 
>>>> +	__u32 flags;
>>>> +};
>>>> +
>>>> +No flag is supported at the moment.
>>>> +
>>>> +When the guest issues TCE call on a liobn for which a TCE table has been
>>>> +registered, the kernel will handle it in real mode, updating the hardware
>>>> +TCE table. TCE table calls for other liobns will cause a vm exit and must
>>>> +be handled by userspace.
>>>> +
>>>> +
>>>>  5. The kvm_run structure
>>>>  ------------------------
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index 36ceb0d..2b70cbc 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>>>>  	struct kvm *kvm;
>>>>  	u64 liobn;
>>>>  	u32 window_size;
>>>> +	bool virtmode_only;
>>>
>>> I see this is now initialized from the global parameter, but I think
>>> it would be better to just check the global (debug) parameter
>>> directly, rather than duplicating it here.
>>
>>
>> The global parameter is in kvm.ko and the struct above is in the real mode
>> part which cannot go to the module.
> 
> Ah, ok.  I'm half inclined to just drop the virtmode_only thing
> entirely.
> 
>>>> +	struct iommu_group *grp;    /* used for IOMMU groups */
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index d501246..bdfa140 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce *args);
>>>> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>>>> +				struct kvm_create_spapr_tce_iommu *args);
>>>>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>> index 681b314..b67d44b 100644
>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>>>>  	__u32 window_size;
>>>>  };
>>>>  
>>>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>>>> +struct kvm_create_spapr_tce_iommu {
>>>> +	__u64 liobn;
>>>> +	__u32 iommu_id;
>>>> +	__u32 flags;
>>>> +};
>>>> +
>>>>  /* for KVM_ALLOCATE_RMA */
>>>>  struct kvm_allocate_rma {
>>>>  	__u64 rma_size;
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 643ac1e..98cf949 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -27,6 +27,9 @@
>>>>  #include <linux/hugetlb.h>
>>>>  #include <linux/list.h>
>>>>  #include <linux/anon_inodes.h>
>>>> +#include <linux/pci.h>
>>>> +#include <linux/iommu.h>
>>>> +#include <linux/module.h>
>>>>  
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/kvm_ppc.h>
>>>> @@ -38,10 +41,19 @@
>>>>  #include <asm/kvm_host.h>
>>>>  #include <asm/udbg.h>
>>>>  #include <asm/iommu.h>
>>>> +#include <asm/tce.h>
>>>> +
>>>> +#define DRIVER_VERSION	"0.1"
>>>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
>>>> +#define DRIVER_DESC	"POWERPC KVM driver"
>>>
>>> Really?
>>
>>
>> What is wrong here?
> 
> Well, it seems entirely unrelated to the rest of the changes, 


The patch adds a module parameter so I had to add those DRIVER_xxx.


> and not obviously accurate.

Let's fix it then. How? Paul signed it...



-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  6:27           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-07  6:27 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

On 05/07/2013 04:02 PM, David Gibson wrote:
> On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
>> On 05/07/2013 03:29 PM, David Gibson wrote:
>>> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests without passing them to QEMU, which should
>>>> save time on switching to QEMU and back.
>>>>
>>>> Both real and virtual modes are supported - whenever the kernel
>>>> fails to handle TCE request, it passes it to the virtual mode.
>>>> If it the virtual mode handlers fail, then the request is passed
>>>> to the user mode, for example, to QEMU.
>>>>
>>>> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
>>>> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
>>>> in-kernel handling of IOMMU map/unmap.
>>>>
>>>> This adds a special case for huge pages (16MB).  The reference
>>>> counting cannot be easily done for such pages in real mode (when
>>>> MMU is off) so we added a list of huge pages.  It is populated in
>>>> virtual mode and get_page is called just once per a huge page.
>>>> Real mode handlers check if the requested page is huge and in the list,
>>>> then no reference counting is done, otherwise an exit to virtual mode
>>>> happens.  The list is released at KVM exit.  At the moment the fastest
>>>> card available for tests uses up to 9 huge pages so walking through this
>>>> list is not very expensive.  However this can change and we may want
>>>> to optimize this.
>>>>
>>>> This also adds the virt_only parameter to the KVM module
>>>> for debug and performance check purposes.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>> ---
>>>>  Documentation/virtual/kvm/api.txt   |   28 ++++
>>>>  arch/powerpc/include/asm/kvm_host.h |    2 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>>>>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>>>>  arch/powerpc/kvm/powerpc.c          |   12 ++
>>>>  include/uapi/linux/kvm.h            |    2 +
>>>>  8 files changed, 485 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index f621cd6..2039767 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>>>>  valid entries found.
>>>>  
>>>>  
>>>> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
>>>> +
>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>> +Architectures: powerpc
>>>> +Type: vm ioctl
>>>> +Parameters: struct kvm_create_spapr_tce_iommu (in)
>>>> +Returns: 0 on success, -1 on error
>>>> +
>>>> +This creates a link between IOMMU group and a hardware TCE (translation
>>>> +control entry) table. This link lets the host kernel know what IOMMU
>>>> +group (i.e. TCE table) to use for the LIOBN number passed with
>>>> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
>>>> +
>>>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>>>> +struct kvm_create_spapr_tce_iommu {
>>>> +	__u64 liobn;
>>>> +	__u32 iommu_id;
>>>
>>> Wouldn't it be more in keeping 
>>
>>
>> pardon?
> 
> Sorry, I was going to suggest a change, but then realised it wasn't
> actually any better than what you have now.
> 
>>>> +	__u32 flags;
>>>> +};
>>>> +
>>>> +No flag is supported at the moment.
>>>> +
>>>> +When the guest issues TCE call on a liobn for which a TCE table has been
>>>> +registered, the kernel will handle it in real mode, updating the hardware
>>>> +TCE table. TCE table calls for other liobns will cause a vm exit and must
>>>> +be handled by userspace.
>>>> +
>>>> +
>>>>  5. The kvm_run structure
>>>>  ------------------------
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index 36ceb0d..2b70cbc 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>>>>  	struct kvm *kvm;
>>>>  	u64 liobn;
>>>>  	u32 window_size;
>>>> +	bool virtmode_only;
>>>
>>> I see this is now initialized from the global parameter, but I think
>>> it would be better to just check the global (debug) parameter
>>> directly, rather than duplicating it here.
>>
>>
>> The global parameter is in kvm.ko and the struct above is in the real mode
>> part which cannot go to the module.
> 
> Ah, ok.  I'm half inclined to just drop the virtmode_only thing
> entirely.
> 
>>>> +	struct iommu_group *grp;    /* used for IOMMU groups */
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index d501246..bdfa140 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce *args);
>>>> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>>>> +				struct kvm_create_spapr_tce_iommu *args);
>>>>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>> index 681b314..b67d44b 100644
>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>>>>  	__u32 window_size;
>>>>  };
>>>>  
>>>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>>>> +struct kvm_create_spapr_tce_iommu {
>>>> +	__u64 liobn;
>>>> +	__u32 iommu_id;
>>>> +	__u32 flags;
>>>> +};
>>>> +
>>>>  /* for KVM_ALLOCATE_RMA */
>>>>  struct kvm_allocate_rma {
>>>>  	__u64 rma_size;
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 643ac1e..98cf949 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -27,6 +27,9 @@
>>>>  #include <linux/hugetlb.h>
>>>>  #include <linux/list.h>
>>>>  #include <linux/anon_inodes.h>
>>>> +#include <linux/pci.h>
>>>> +#include <linux/iommu.h>
>>>> +#include <linux/module.h>
>>>>  
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/kvm_ppc.h>
>>>> @@ -38,10 +41,19 @@
>>>>  #include <asm/kvm_host.h>
>>>>  #include <asm/udbg.h>
>>>>  #include <asm/iommu.h>
>>>> +#include <asm/tce.h>
>>>> +
>>>> +#define DRIVER_VERSION	"0.1"
>>>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
>>>> +#define DRIVER_DESC	"POWERPC KVM driver"
>>>
>>> Really?
>>
>>
>> What is wrong here?
> 
> Well, it seems entirely unrelated to the rest of the changes, 


The patch adds a module parameter so I had to add those DRIVER_xxx.


> and not obviously accurate.

Let's fix it then. How? Paul signed it...



-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  6:27           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-07  6:27 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

On 05/07/2013 04:02 PM, David Gibson wrote:
> On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
>> On 05/07/2013 03:29 PM, David Gibson wrote:
>>> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests without passing them to QEMU, which should
>>>> save time on switching to QEMU and back.
>>>>
>>>> Both real and virtual modes are supported - whenever the kernel
>>>> fails to handle TCE request, it passes it to the virtual mode.
>>>> If it the virtual mode handlers fail, then the request is passed
>>>> to the user mode, for example, to QEMU.
>>>>
>>>> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
>>>> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
>>>> in-kernel handling of IOMMU map/unmap.
>>>>
>>>> This adds a special case for huge pages (16MB).  The reference
>>>> counting cannot be easily done for such pages in real mode (when
>>>> MMU is off) so we added a list of huge pages.  It is populated in
>>>> virtual mode and get_page is called just once per a huge page.
>>>> Real mode handlers check if the requested page is huge and in the list,
>>>> then no reference counting is done, otherwise an exit to virtual mode
>>>> happens.  The list is released at KVM exit.  At the moment the fastest
>>>> card available for tests uses up to 9 huge pages so walking through this
>>>> list is not very expensive.  However this can change and we may want
>>>> to optimize this.
>>>>
>>>> This also adds the virt_only parameter to the KVM module
>>>> for debug and performance check purposes.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>> ---
>>>>  Documentation/virtual/kvm/api.txt   |   28 ++++
>>>>  arch/powerpc/include/asm/kvm_host.h |    2 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>>>>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>>>>  arch/powerpc/kvm/powerpc.c          |   12 ++
>>>>  include/uapi/linux/kvm.h            |    2 +
>>>>  8 files changed, 485 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index f621cd6..2039767 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>>>>  valid entries found.
>>>>  
>>>>  
>>>> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
>>>> +
>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>> +Architectures: powerpc
>>>> +Type: vm ioctl
>>>> +Parameters: struct kvm_create_spapr_tce_iommu (in)
>>>> +Returns: 0 on success, -1 on error
>>>> +
>>>> +This creates a link between IOMMU group and a hardware TCE (translation
>>>> +control entry) table. This link lets the host kernel know what IOMMU
>>>> +group (i.e. TCE table) to use for the LIOBN number passed with
>>>> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
>>>> +
>>>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>>>> +struct kvm_create_spapr_tce_iommu {
>>>> +	__u64 liobn;
>>>> +	__u32 iommu_id;
>>>
>>> Wouldn't it be more in keeping 
>>
>>
>> pardon?
> 
> Sorry, I was going to suggest a change, but then realised it wasn't
> actually any better than what you have now.
> 
>>>> +	__u32 flags;
>>>> +};
>>>> +
>>>> +No flag is supported at the moment.
>>>> +
>>>> +When the guest issues TCE call on a liobn for which a TCE table has been
>>>> +registered, the kernel will handle it in real mode, updating the hardware
>>>> +TCE table. TCE table calls for other liobns will cause a vm exit and must
>>>> +be handled by userspace.
>>>> +
>>>> +
>>>>  5. The kvm_run structure
>>>>  ------------------------
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index 36ceb0d..2b70cbc 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>>>>  	struct kvm *kvm;
>>>>  	u64 liobn;
>>>>  	u32 window_size;
>>>> +	bool virtmode_only;
>>>
>>> I see this is now initialized from the global parameter, but I think
>>> it would be better to just check the global (debug) parameter
>>> directly, rather than duplicating it here.
>>
>>
>> The global parameter is in kvm.ko and the struct above is in the real mode
>> part which cannot go to the module.
> 
> Ah, ok.  I'm half inclined to just drop the virtmode_only thing
> entirely.
> 
>>>> +	struct iommu_group *grp;    /* used for IOMMU groups */
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index d501246..bdfa140 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce *args);
>>>> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>>>> +				struct kvm_create_spapr_tce_iommu *args);
>>>>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>> index 681b314..b67d44b 100644
>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>>>>  	__u32 window_size;
>>>>  };
>>>>  
>>>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>>>> +struct kvm_create_spapr_tce_iommu {
>>>> +	__u64 liobn;
>>>> +	__u32 iommu_id;
>>>> +	__u32 flags;
>>>> +};
>>>> +
>>>>  /* for KVM_ALLOCATE_RMA */
>>>>  struct kvm_allocate_rma {
>>>>  	__u64 rma_size;
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 643ac1e..98cf949 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -27,6 +27,9 @@
>>>>  #include <linux/hugetlb.h>
>>>>  #include <linux/list.h>
>>>>  #include <linux/anon_inodes.h>
>>>> +#include <linux/pci.h>
>>>> +#include <linux/iommu.h>
>>>> +#include <linux/module.h>
>>>>  
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/kvm_ppc.h>
>>>> @@ -38,10 +41,19 @@
>>>>  #include <asm/kvm_host.h>
>>>>  #include <asm/udbg.h>
>>>>  #include <asm/iommu.h>
>>>> +#include <asm/tce.h>
>>>> +
>>>> +#define DRIVER_VERSION	"0.1"
>>>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
>>>> +#define DRIVER_DESC	"POWERPC KVM driver"
>>>
>>> Really?
>>
>>
>> What is wrong here?
> 
> Well, it seems entirely unrelated to the rest of the changes, 


The patch adds a module parameter so I had to add those DRIVER_xxx.


> and not obviously accurate.

Let's fix it then. How? Paul signed it...



-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
  2013-05-07  6:27           ` Alexey Kardashevskiy
@ 2013-05-07  6:54             ` David Gibson
  -1 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  6:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

On Tue, May 07, 2013 at 04:27:49PM +1000, Alexey Kardashevskiy wrote:
> On 05/07/2013 04:02 PM, David Gibson wrote:
> > On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
> >> On 05/07/2013 03:29 PM, David Gibson wrote:
> >>> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
[snip]
> >>>> +#define DRIVER_VERSION	"0.1"
> >>>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> >>>> +#define DRIVER_DESC	"POWERPC KVM driver"
> >>>
> >>> Really?
> >>
> >>
> >> What is wrong here?
> > 
> > Well, it seems entirely unrelated to the rest of the changes, 
> 
> 
> The patch adds a module parameter so I had to add those DRIVER_xxx.

Ah, ok.

> > and not obviously accurate.
> 
> Let's fix it then. How? Paul signed it...

Fair enough then.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
@ 2013-05-07  6:54             ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-07  6:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

On Tue, May 07, 2013 at 04:27:49PM +1000, Alexey Kardashevskiy wrote:
> On 05/07/2013 04:02 PM, David Gibson wrote:
> > On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
> >> On 05/07/2013 03:29 PM, David Gibson wrote:
> >>> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
[snip]
> >>>> +#define DRIVER_VERSION	"0.1"
> >>>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> >>>> +#define DRIVER_DESC	"POWERPC KVM driver"
> >>>
> >>> Really?
> >>
> >>
> >> What is wrong here?
> > 
> > Well, it seems entirely unrelated to the rest of the changes, 
> 
> 
> The patch adds a module parameter so I had to add those DRIVER_xxx.

Ah, ok.

> > and not obviously accurate.
> 
> Let's fix it then. How? Paul signed it...

Fair enough then.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
  2013-05-06  7:25   ` Alexey Kardashevskiy
  (?)
@ 2013-05-10  6:51     ` David Gibson
  -1 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-10  6:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 19022 bytes --]

On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.


Hrm.  Clearly I didn't read this carefully enough before.  There are
some problems here.

[snip]
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 72ffc89..643ac1e 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -36,9 +37,14 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> +/*
> + * TCE tables handlers.
> + */
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
>  	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
> @@ -148,3 +154,111 @@ fail:
>  	}
>  	return ret;
>  }
> +
> +/*
> + * Virtual mode handling of IOMMU map/unmap.
> + */
> +/* Converts guest physical address into host virtual */
> +static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)

This should probably return a void * rather than an unsigned long.
Well, actually a void __user *.

> +{
> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
> +	struct kvm_memory_slot *memslot;
> +
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/*
> +	 * Convert gfn to hva preserving flags and an offset
> +	 * within a system page
> +	 */
> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
> +	return hva;
> +}
> +
> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
> +}
> +
> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +	unsigned long tces;
> +
> +	/* The whole table addressed by tce_list resides in 4K page */
> +	if (npages > 512)
> +		return H_PARAMETER;

So, that doesn't actually verify what the comment says it does - only
that the list is < 4K in total.  You need to check the alignment of
tce_list as well.

> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	tces = get_virt_address(vcpu, tce_list);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */

This comment doesn't seem to have any bearing on the test which
follows it.

> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		unsigned long tce;
> +		unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +		if (get_user(tce, (unsigned long __user *)ptce))
> +			break;
> +
> +		if (kvmppc_emulated_h_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
> +			break;
> +	}
> +	if (i == npages)
> +		return H_SUCCESS;
> +
> +	/* Failed, do cleanup */
> +	do {
> +		--i;
> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				0);
> +	} while (i);

Hrm, so, actually PAPR specifies that this hcall is supposed to first
copy the given tces to hypervisor memory, then translate (and
validate) them all, and only then touch the actual TCE table.  Rather
more complicated to do, but I guess we should - that would get rid of
the need for this partial cleanup in the failure case.

> +
> +	return H_PARAMETER;
> +}
> +
> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..55fdf7a 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -35,42 +36,214 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - *          mode on PR KVM
> +/*
> + * Finds a TCE table descriptor by LIOBN.
>   */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> +		unsigned long liobn)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == liobn)
> +			return tt;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +/*
> + * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
> + * by QEMU. It puts guest TCE values into the table and expects
> + * the QEMU to convert them later in the QEMU device implementation.
> + * Works in both real and virtual modes.
> + */
> +long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
> +	/*	    liobn, tt, tt->window_size); */
> +	if (ioba >= tt->window_size) {
> +		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
> +		return H_PARAMETER;
> +	}
> +	/*
> +	 * Note on the use of page_address() in real mode,
> +	 *
> +	 * It is safe to use page_address() in real mode on ppc64 because
> +	 * page_address() is always defined as lowmem_page_address()
> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> +	 * operation and does not access page struct.
> +	 *
> +	 * Theoretically page_address() could be defined different
> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> +	 * should be enabled.
> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> +	 * is not expected to be enabled on ppc32, page_address()
> +	 * is safe for ppc32 as well.
> +	 */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	page = tt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);
> +
> +	/*
> +	 * Validate TCE address.
> +	 * At the moment only flags are validated
> +	 * as other check will significantly slow down
> +	 * or can make it even impossible to handle TCE requests
> +	 * in real mode.
> +	 */
> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +/*
> + * Converts guest physical address into host real address.
> + * Also returns pte and page size if the page is present in page table.
> + */
> +static unsigned long get_real_address(struct kvm_vcpu *vcpu,
> +		unsigned long gpa, bool writing,
> +		pte_t *ptep, unsigned long *pg_sizep)

The only caller doesn't use the ptep and pg_sizep pointers, so there's
no point implementing them.

> +{
> +	struct kvm_memory_slot *memslot;
> +	pte_t pte;
> +	unsigned long hva, pg_size = 0, hwaddr, offset;
> +	unsigned long gfn = gpa >> PAGE_SHIFT;
> +
> +	/* Find a KVM memslot */
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/* Convert guest physical address to host virtual */
> +	hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> +	/* Find a PTE and determine the size */
> +	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte_present(pte))
> +		return ERROR_ADDR;
> +
> +	/* Calculate host phys address keeping flags and offset in the page */
> +	offset = gpa & (pg_size - 1);
> +
> +	/* pte_pfn(pte) should return an address aligned to pg_size */
> +	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
> +
> +	/* Copy outer values if required */
> +	if (pg_sizep)
> +		*pg_sizep = pg_size;
> +	if (ptep)
> +		*ptep = pte;
> +
> +	return hwaddr;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
> -	struct kvmppc_spapr_tce_table *stt;
> -
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> -
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> -
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> -
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> -
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +	unsigned long tces;
> +
> +	/* The whole table addressed by tce_list resides in 4K page */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		unsigned long tce;
> +		unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +		if (get_user(tce, (unsigned long __user *)ptce))
> +			break;

Uh.  get_user() should not be used from real mode.  You'll need to
dereference the pointer directly here.

> +
> +		if (kvmppc_emulated_h_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT), tce))
> +			break;
>  	}
> +	if (i == npages)
> +		return H_SUCCESS;
> +
> +	/* Failed, do cleanup */
> +	do {
> +		--i;
> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				0);
> +	} while (i);
> +
> +	return H_PARAMETER;
> +}
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
>  }
> +
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index e5afdcb..6eb6f44 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -565,6 +565,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_xics_hcall(vcpu, req);
>  			break;
>  		} /* fallthrough */


As paulus already pointed out, inserting this code here breaks the no
kernel xics case badly because of the fallthrough above.

> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index d3e26be..4d73406 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1472,6 +1472,12 @@ hcall_real_table:
>  	.long	0		/* 0x11c */
>  	.long	0		/* 0x120 */
>  	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>  hcall_real_table_end:
>  
>  ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index 8352cac..603e4f2 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>  	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>  	long rc;
>  
> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  	if (rc == H_TOO_HARD)
>  		return EMULATE_FAIL;
>  	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -249,6 +280,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>  		return kvmppc_h_pr_bulk_remove(vcpu);
>  	case H_PUT_TCE:
>  		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
>  	case H_CEDE:
>  		vcpu->arch.shared->msr |= MSR_EE;
>  		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index f9c159e..b7ad589 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -384,6 +384,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6f49c87..6c04da1 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -640,6 +640,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_HTAB_FD 84
>  #define KVM_CAP_PPC_RTAS (0x100000 + 87)
>  #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
> +#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
@ 2013-05-10  6:51     ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-10  6:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 19022 bytes --]

On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.


Hrm.  Clearly I didn't read this carefully enough before.  There are
some problems here.

[snip]
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 72ffc89..643ac1e 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -36,9 +37,14 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> +/*
> + * TCE tables handlers.
> + */
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
>  	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
> @@ -148,3 +154,111 @@ fail:
>  	}
>  	return ret;
>  }
> +
> +/*
> + * Virtual mode handling of IOMMU map/unmap.
> + */
> +/* Converts guest physical address into host virtual */
> +static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)

This should probably return a void * rather than an unsigned long.
Well, actually a void __user *.

> +{
> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
> +	struct kvm_memory_slot *memslot;
> +
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/*
> +	 * Convert gfn to hva preserving flags and an offset
> +	 * within a system page
> +	 */
> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
> +	return hva;
> +}
> +
> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
> +}
> +
> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +	unsigned long tces;
> +
> +	/* The whole table addressed by tce_list resides in 4K page */
> +	if (npages > 512)
> +		return H_PARAMETER;

So, that doesn't actually verify what the comment says it does - only
that the list is < 4K in total.  You need to check the alignment of
tce_list as well.

> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	tces = get_virt_address(vcpu, tce_list);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */

This comment doesn't seem to have any bearing on the test which
follows it.

> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		unsigned long tce;
> +		unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +		if (get_user(tce, (unsigned long __user *)ptce))
> +			break;
> +
> +		if (kvmppc_emulated_h_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
> +			break;
> +	}
> +	if (i == npages)
> +		return H_SUCCESS;
> +
> +	/* Failed, do cleanup */
> +	do {
> +		--i;
> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				0);
> +	} while (i);

Hrm, so, actually PAPR specifies that this hcall is supposed to first
copy the given tces to hypervisor memory, then translate (and
validate) them all, and only then touch the actual TCE table.  Rather
more complicated to do, but I guess we should - that would get rid of
the need for this partial cleanup in the failure case.

> +
> +	return H_PARAMETER;
> +}
> +
> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..55fdf7a 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -35,42 +36,214 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - *          mode on PR KVM
> +/*
> + * Finds a TCE table descriptor by LIOBN.
>   */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> +		unsigned long liobn)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == liobn)
> +			return tt;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +/*
> + * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
> + * by QEMU. It puts guest TCE values into the table and expects
> + * the QEMU to convert them later in the QEMU device implementation.
> + * Works in both real and virtual modes.
> + */
> +long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
> +	/*	    liobn, tt, tt->window_size); */
> +	if (ioba >= tt->window_size) {
> +		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
> +		return H_PARAMETER;
> +	}
> +	/*
> +	 * Note on the use of page_address() in real mode,
> +	 *
> +	 * It is safe to use page_address() in real mode on ppc64 because
> +	 * page_address() is always defined as lowmem_page_address()
> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> +	 * operation and does not access page struct.
> +	 *
> +	 * Theoretically page_address() could be defined different
> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> +	 * should be enabled.
> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> +	 * is not expected to be enabled on ppc32, page_address()
> +	 * is safe for ppc32 as well.
> +	 */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	page = tt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);
> +
> +	/*
> +	 * Validate TCE address.
> +	 * At the moment only flags are validated
> +	 * as other check will significantly slow down
> +	 * or can make it even impossible to handle TCE requests
> +	 * in real mode.
> +	 */
> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +/*
> + * Converts guest physical address into host real address.
> + * Also returns pte and page size if the page is present in page table.
> + */
> +static unsigned long get_real_address(struct kvm_vcpu *vcpu,
> +		unsigned long gpa, bool writing,
> +		pte_t *ptep, unsigned long *pg_sizep)

The only caller doesn't use the ptep and pg_sizep pointers, so there's
no point implementing them.

> +{
> +	struct kvm_memory_slot *memslot;
> +	pte_t pte;
> +	unsigned long hva, pg_size = 0, hwaddr, offset;
> +	unsigned long gfn = gpa >> PAGE_SHIFT;
> +
> +	/* Find a KVM memslot */
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/* Convert guest physical address to host virtual */
> +	hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> +	/* Find a PTE and determine the size */
> +	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte_present(pte))
> +		return ERROR_ADDR;
> +
> +	/* Calculate host phys address keeping flags and offset in the page */
> +	offset = gpa & (pg_size - 1);
> +
> +	/* pte_pfn(pte) should return an address aligned to pg_size */
> +	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
> +
> +	/* Copy outer values if required */
> +	if (pg_sizep)
> +		*pg_sizep = pg_size;
> +	if (ptep)
> +		*ptep = pte;
> +
> +	return hwaddr;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
> -	struct kvmppc_spapr_tce_table *stt;
> -
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> -
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> -
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> -
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> -
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +	unsigned long tces;
> +
> +	/* The whole table addressed by tce_list resides in 4K page */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		unsigned long tce;
> +		unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +		if (get_user(tce, (unsigned long __user *)ptce))
> +			break;

Uh.  get_user() should not be used from real mode.  You'll need to
dereference the pointer directly here.

> +
> +		if (kvmppc_emulated_h_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT), tce))
> +			break;
>  	}
> +	if (i == npages)
> +		return H_SUCCESS;
> +
> +	/* Failed, do cleanup */
> +	do {
> +		--i;
> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				0);
> +	} while (i);
> +
> +	return H_PARAMETER;
> +}
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
>  }
> +
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index e5afdcb..6eb6f44 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -565,6 +565,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_xics_hcall(vcpu, req);
>  			break;
>  		} /* fallthrough */


As paulus already pointed out, inserting this code here breaks the no
kernel xics case badly because of the fallthrough above.

> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index d3e26be..4d73406 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1472,6 +1472,12 @@ hcall_real_table:
>  	.long	0		/* 0x11c */
>  	.long	0		/* 0x120 */
>  	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>  hcall_real_table_end:
>  
>  ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index 8352cac..603e4f2 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>  	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>  	long rc;
>  
> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  	if (rc == H_TOO_HARD)
>  		return EMULATE_FAIL;
>  	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -249,6 +280,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>  		return kvmppc_h_pr_bulk_remove(vcpu);
>  	case H_PUT_TCE:
>  		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
>  	case H_CEDE:
>  		vcpu->arch.shared->msr |= MSR_EE;
>  		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index f9c159e..b7ad589 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -384,6 +384,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6f49c87..6c04da1 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -640,6 +640,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_HTAB_FD 84
>  #define KVM_CAP_PPC_RTAS (0x100000 + 87)
>  #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
> +#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
@ 2013-05-10  6:51     ` David Gibson
  0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2013-05-10  6:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 19022 bytes --]

On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.


Hrm.  Clearly I didn't read this carefully enough before.  There are
some problems here.

[snip]
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 72ffc89..643ac1e 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -36,9 +37,14 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> +/*
> + * TCE tables handlers.
> + */
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
>  	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
> @@ -148,3 +154,111 @@ fail:
>  	}
>  	return ret;
>  }
> +
> +/*
> + * Virtual mode handling of IOMMU map/unmap.
> + */
> +/* Converts guest physical address into host virtual */
> +static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
> +		unsigned long gpa)

This should probably return a void * rather than an unsigned long.
Well, actually a void __user *.

> +{
> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
> +	struct kvm_memory_slot *memslot;
> +
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/*
> +	 * Convert gfn to hva preserving flags and an offset
> +	 * within a system page
> +	 */
> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
> +	return hva;
> +}
> +
> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
> +}
> +
> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +	unsigned long tces;
> +
> +	/* The whole table addressed by tce_list resides in 4K page */
> +	if (npages > 512)
> +		return H_PARAMETER;

So, that doesn't actually verify what the comment says it does - only
that the list is < 4K in total.  You need to check the alignment of
tce_list as well.

> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	tces = get_virt_address(vcpu, tce_list);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */

This comment doesn't seem to have any bearing on the test which
follows it.

> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		unsigned long tce;
> +		unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +		if (get_user(tce, (unsigned long __user *)ptce))
> +			break;
> +
> +		if (kvmppc_emulated_h_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
> +			break;
> +	}
> +	if (i == npages)
> +		return H_SUCCESS;
> +
> +	/* Failed, do cleanup */
> +	do {
> +		--i;
> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				0);
> +	} while (i);

Hrm, so, actually PAPR specifies that this hcall is supposed to first
copy the given tces to hypervisor memory, then translate (and
validate) them all, and only then touch the actual TCE table.  Rather
more complicated to do, but I guess we should - that would get rid of
the need for this partial cleanup in the failure case.

> +
> +	return H_PARAMETER;
> +}
> +
> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..55fdf7a 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
>   *
>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>   */
>  
>  #include <linux/types.h>
> @@ -35,42 +36,214 @@
>  #include <asm/ppc-opcode.h>
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>  
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - *          mode on PR KVM
> +/*
> + * Finds a TCE table descriptor by LIOBN.
>   */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> +		unsigned long liobn)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == liobn)
> +			return tt;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +/*
> + * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
> + * by QEMU. It puts guest TCE values into the table and expects
> + * the QEMU to convert them later in the QEMU device implementation.
> + * Works in both real and virtual modes.
> + */
> +long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
> +	/*	    liobn, tt, tt->window_size); */
> +	if (ioba >= tt->window_size) {
> +		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
> +		return H_PARAMETER;
> +	}
> +	/*
> +	 * Note on the use of page_address() in real mode,
> +	 *
> +	 * It is safe to use page_address() in real mode on ppc64 because
> +	 * page_address() is always defined as lowmem_page_address()
> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> +	 * operation and does not access page struct.
> +	 *
> +	 * Theoretically page_address() could be defined different
> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> +	 * should be enabled.
> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> +	 * is not expected to be enabled on ppc32, page_address()
> +	 * is safe for ppc32 as well.
> +	 */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> +	page = tt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);
> +
> +	/*
> +	 * Validate TCE address.
> +	 * At the moment only flags are validated
> +	 * as other check will significantly slow down
> +	 * or can make it even impossible to handle TCE requests
> +	 * in real mode.
> +	 */
> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> +	tbl[idx % TCES_PER_PAGE] = tce;
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +/*
> + * Converts guest physical address into host real address.
> + * Also returns pte and page size if the page is present in page table.
> + */
> +static unsigned long get_real_address(struct kvm_vcpu *vcpu,
> +		unsigned long gpa, bool writing,
> +		pte_t *ptep, unsigned long *pg_sizep)

The only caller doesn't use the ptep and pg_sizep pointers, so there's
no point implementing them.

> +{
> +	struct kvm_memory_slot *memslot;
> +	pte_t pte;
> +	unsigned long hva, pg_size = 0, hwaddr, offset;
> +	unsigned long gfn = gpa >> PAGE_SHIFT;
> +
> +	/* Find a KVM memslot */
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	/* Convert guest physical address to host virtual */
> +	hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> +	/* Find a PTE and determine the size */
> +	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte_present(pte))
> +		return ERROR_ADDR;
> +
> +	/* Calculate host phys address keeping flags and offset in the page */
> +	offset = gpa & (pg_size - 1);
> +
> +	/* pte_pfn(pte) should return an address aligned to pg_size */
> +	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
> +
> +	/* Copy outer values if required */
> +	if (pg_sizep)
> +		*pg_sizep = pg_size;
> +	if (ptep)
> +		*ptep = pte;
> +
> +	return hwaddr;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> -	struct kvm *kvm = vcpu->kvm;
> -	struct kvmppc_spapr_tce_table *stt;
> -
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> -
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> -
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba >= stt->window_size)
> -				return H_PARAMETER;
> -
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> -
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +	unsigned long tces;
> +
> +	/* The whole table addressed by tce_list resides in 4K page */
> +	if (npages > 512)
> +		return H_PARAMETER;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i) {
> +		unsigned long tce;
> +		unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +		if (get_user(tce, (unsigned long __user *)ptce))
> +			break;

Uh.  get_user() should not be used from real mode.  You'll need to
dereference the pointer directly here.

> +
> +		if (kvmppc_emulated_h_put_tce(tt,
> +				ioba + (i << IOMMU_PAGE_SHIFT), tce))
> +			break;
>  	}
> +	if (i == npages)
> +		return H_SUCCESS;
> +
> +	/* Failed, do cleanup */
> +	do {
> +		--i;
> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> +				0);
> +	} while (i);
> +
> +	return H_PARAMETER;
> +}
>  
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to virtual space */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	/* Emulated IO */
> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> +		return H_PARAMETER;
> +
> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
>  }
> +
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index e5afdcb..6eb6f44 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -565,6 +565,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_xics_hcall(vcpu, req);
>  			break;
>  		} /* fallthrough */


As paulus already pointed out, inserting this code here breaks the no
kernel xics case badly because of the fallthrough above.

> +	case H_PUT_TCE:
> +		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>  	default:
>  		return RESUME_HOST;
>  	}
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index d3e26be..4d73406 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1472,6 +1472,12 @@ hcall_real_table:
>  	.long	0		/* 0x11c */
>  	.long	0		/* 0x120 */
>  	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>  hcall_real_table_end:
>  
>  ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index 8352cac..603e4f2 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>  	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>  	long rc;
>  
> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>  	if (rc == H_TOO_HARD)
>  		return EMULATE_FAIL;
>  	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -249,6 +280,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>  		return kvmppc_h_pr_bulk_remove(vcpu);
>  	case H_PUT_TCE:
>  		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
>  	case H_CEDE:
>  		vcpu->arch.shared->msr |= MSR_EE;
>  		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index f9c159e..b7ad589 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -384,6 +384,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = 1;
>  		break;
>  #endif
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r = 1;
> +		break;
>  	default:
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6f49c87..6c04da1 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -640,6 +640,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_HTAB_FD 84
>  #define KVM_CAP_PPC_RTAS (0x100000 + 87)
>  #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
> +#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
  2013-05-10  6:51     ` David Gibson
  (?)
@ 2013-05-10  7:53       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-10  7:53 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

On 05/10/2013 04:51 PM, David Gibson wrote:
> On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
>> devices or emulated PCI.  These calls allow adding multiple entries
>> (up to 512) into the TCE table in one call which saves time on
>> transition to/from real mode.
>>
>> This adds a guest physical to host real address converter
>> and calls the existing H_PUT_TCE handler. The converting function
>> is going to be fully utilized by upcoming VFIO supporting patches.
>>
>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>> so in order to support the functionality of this patch, QEMU
>> needs to query for this capability and set the "hcall-multi-tce"
>> hypertas property only if the capability is present, otherwise
>> there will be serious performance degradation.
> 
> 
> Hrm.  Clearly I didn't read this carefully enough before.  There are
> some problems here.

?


> [snip]
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 72ffc89..643ac1e 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -14,6 +14,7 @@
>>   *
>>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>>   */
>>  
>>  #include <linux/types.h>
>> @@ -36,9 +37,14 @@
>>  #include <asm/ppc-opcode.h>
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>> +#include <asm/iommu.h>
>>  
>>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>  
>> +/*
>> + * TCE tables handlers.
>> + */
>>  static long kvmppc_stt_npages(unsigned long window_size)
>>  {
>>  	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
>> @@ -148,3 +154,111 @@ fail:
>>  	}
>>  	return ret;
>>  }
>> +
>> +/*
>> + * Virtual mode handling of IOMMU map/unmap.
>> + */
>> +/* Converts guest physical address into host virtual */
>> +static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
>> +		unsigned long gpa)
> 
> This should probably return a void * rather than an unsigned long.
> Well, actually a void __user *.
> 
>> +{
>> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
>> +	struct kvm_memory_slot *memslot;
>> +
>> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>> +	if (!memslot)
>> +		return ERROR_ADDR;
>> +
>> +	/*
>> +	 * Convert gfn to hva preserving flags and an offset
>> +	 * within a system page
>> +	 */
>> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
>> +	return hva;
>> +}
>> +
>> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>> +}
>> +
>> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +	long i;
>> +	unsigned long tces;
>> +
>> +	/* The whole table addressed by tce_list resides in 4K page */
>> +	if (npages > 512)
>> +		return H_PARAMETER;
> 
> So, that doesn't actually verify what the comment says it does - only
> that the list is < 4K in total.  You need to check the alignment of
> tce_list as well.



The spec says to return H_PARAMETER if >512. I.e. it takes just 1 page and
I do not need to bother if pages may not lay continuously in RAM (matters
for real mode).

/*
 * As the spec is saying that maximum possible number of TCEs is 512,
 * the whole TCE page is no more than 4K. Therefore we do not have to
 * worry if pages do not lie continuously in the RAM
 */
Any better?...


>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	tces = get_virt_address(vcpu, tce_list);
>> +	if (tces == ERROR_ADDR)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
> 
> This comment doesn't seem to have any bearing on the test which
> follows it.
> 
>> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		unsigned long tce;
>> +		unsigned long ptce = tces + i * sizeof(unsigned long);
>> +
>> +		if (get_user(tce, (unsigned long __user *)ptce))
>> +			break;
>> +
>> +		if (kvmppc_emulated_h_put_tce(tt,
>> +				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
>> +			break;
>> +	}
>> +	if (i == npages)
>> +		return H_SUCCESS;
>> +
>> +	/* Failed, do cleanup */
>> +	do {
>> +		--i;
>> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
>> +				0);
>> +	} while (i);
> 
> Hrm, so, actually PAPR specifies that this hcall is supposed to first
> copy the given tces to hypervisor memory, then translate (and
> validate) them all, and only then touch the actual TCE table.  Rather
> more complicated to do, but I guess we should - that would get rid of
> the need for this partial cleanup in the failure case.


So we have to kmalloc(4K) on every PUT_INDIRECT. Or we can put tces on the
stack (4K is quire a lot for the kernel, no)?



>> +
>> +	return H_PARAMETER;
>> +}
>> +
>> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +	long i;
>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
>> +
>> +	return H_SUCCESS;
>> +}
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 30c2f3b..55fdf7a 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -14,6 +14,7 @@
>>   *
>>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>>   */
>>  
>>  #include <linux/types.h>
>> @@ -35,42 +36,214 @@
>>  #include <asm/ppc-opcode.h>
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>> +#include <asm/iommu.h>
>> +#include <asm/tce.h>
>>  
>>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>  
>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>> - *          mode on PR KVM
>> +/*
>> + * Finds a TCE table descriptor by LIOBN.
>>   */
>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +
>> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
>> +		if (tt->liobn == liobn)
>> +			return tt;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>> +
>> +/*
>> + * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
>> + * by QEMU. It puts guest TCE values into the table and expects
>> + * the QEMU to convert them later in the QEMU device implementation.
>> + * Works in both real and virtual modes.
>> + */
>> +long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
>> +		unsigned long ioba, unsigned long tce)
>> +{
>> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> +	struct page *page;
>> +	u64 *tbl;
>> +
>> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
>> +	/*	    liobn, tt, tt->window_size); */
>> +	if (ioba >= tt->window_size) {
>> +		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
>> +		return H_PARAMETER;
>> +	}
>> +	/*
>> +	 * Note on the use of page_address() in real mode,
>> +	 *
>> +	 * It is safe to use page_address() in real mode on ppc64 because
>> +	 * page_address() is always defined as lowmem_page_address()
>> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>> +	 * operation and does not access page struct.
>> +	 *
>> +	 * Theoretically page_address() could be defined different
>> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>> +	 * should be enabled.
>> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>> +	 * is not expected to be enabled on ppc32, page_address()
>> +	 * is safe for ppc32 as well.
>> +	 */
>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>> +#error TODO: fix to avoid page_address() here
>> +#endif
>> +	page = tt->pages[idx / TCES_PER_PAGE];
>> +	tbl = (u64 *)page_address(page);
>> +
>> +	/*
>> +	 * Validate TCE address.
>> +	 * At the moment only flags are validated
>> +	 * as other check will significantly slow down
>> +	 * or can make it even impossible to handle TCE requests
>> +	 * in real mode.
>> +	 */
>> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>> +		return H_PARAMETER;
>> +
>> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> +	tbl[idx % TCES_PER_PAGE] = tce;
>> +
>> +	return H_SUCCESS;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
>> +
>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>> +/*
>> + * Converts guest physical address into host real address.
>> + * Also returns pte and page size if the page is present in page table.
>> + */
>> +static unsigned long get_real_address(struct kvm_vcpu *vcpu,
>> +		unsigned long gpa, bool writing,
>> +		pte_t *ptep, unsigned long *pg_sizep)
> 
> The only caller doesn't use the ptep and pg_sizep pointers, so there's
> no point implementing them.


"KVM: PPC: Add support for IOMMU in-kernel handling" will. Is there much
sense in splitting this quite small function between patches?




-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
@ 2013-05-10  7:53       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-10  7:53 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev

On 05/10/2013 04:51 PM, David Gibson wrote:
> On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
>> devices or emulated PCI.  These calls allow adding multiple entries
>> (up to 512) into the TCE table in one call which saves time on
>> transition to/from real mode.
>>
>> This adds a guest physical to host real address converter
>> and calls the existing H_PUT_TCE handler. The converting function
>> is going to be fully utilized by upcoming VFIO supporting patches.
>>
>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>> so in order to support the functionality of this patch, QEMU
>> needs to query for this capability and set the "hcall-multi-tce"
>> hypertas property only if the capability is present, otherwise
>> there will be serious performance degradation.
> 
> 
> Hrm.  Clearly I didn't read this carefully enough before.  There are
> some problems here.

?


> [snip]
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 72ffc89..643ac1e 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -14,6 +14,7 @@
>>   *
>>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>>   */
>>  
>>  #include <linux/types.h>
>> @@ -36,9 +37,14 @@
>>  #include <asm/ppc-opcode.h>
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>> +#include <asm/iommu.h>
>>  
>>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>  
>> +/*
>> + * TCE tables handlers.
>> + */
>>  static long kvmppc_stt_npages(unsigned long window_size)
>>  {
>>  	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
>> @@ -148,3 +154,111 @@ fail:
>>  	}
>>  	return ret;
>>  }
>> +
>> +/*
>> + * Virtual mode handling of IOMMU map/unmap.
>> + */
>> +/* Converts guest physical address into host virtual */
>> +static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
>> +		unsigned long gpa)
> 
> This should probably return a void * rather than an unsigned long.
> Well, actually a void __user *.
> 
>> +{
>> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
>> +	struct kvm_memory_slot *memslot;
>> +
>> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>> +	if (!memslot)
>> +		return ERROR_ADDR;
>> +
>> +	/*
>> +	 * Convert gfn to hva preserving flags and an offset
>> +	 * within a system page
>> +	 */
>> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
>> +	return hva;
>> +}
>> +
>> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>> +}
>> +
>> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +	long i;
>> +	unsigned long tces;
>> +
>> +	/* The whole table addressed by tce_list resides in 4K page */
>> +	if (npages > 512)
>> +		return H_PARAMETER;
> 
> So, that doesn't actually verify what the comment says it does - only
> that the list is < 4K in total.  You need to check the alignment of
> tce_list as well.



The spec says to return H_PARAMETER if >512. I.e. it takes just 1 page and
I do not need to bother if pages may not lay continuously in RAM (matters
for real mode).

/*
 * As the spec is saying that maximum possible number of TCEs is 512,
 * the whole TCE page is no more than 4K. Therefore we do not have to
 * worry if pages do not lie continuously in the RAM
 */
Any better?...


>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	tces = get_virt_address(vcpu, tce_list);
>> +	if (tces == ERROR_ADDR)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
> 
> This comment doesn't seem to have any bearing on the test which
> follows it.
> 
>> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		unsigned long tce;
>> +		unsigned long ptce = tces + i * sizeof(unsigned long);
>> +
>> +		if (get_user(tce, (unsigned long __user *)ptce))
>> +			break;
>> +
>> +		if (kvmppc_emulated_h_put_tce(tt,
>> +				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
>> +			break;
>> +	}
>> +	if (i == npages)
>> +		return H_SUCCESS;
>> +
>> +	/* Failed, do cleanup */
>> +	do {
>> +		--i;
>> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
>> +				0);
>> +	} while (i);
> 
> Hrm, so, actually PAPR specifies that this hcall is supposed to first
> copy the given tces to hypervisor memory, then translate (and
> validate) them all, and only then touch the actual TCE table.  Rather
> more complicated to do, but I guess we should - that would get rid of
> the need for this partial cleanup in the failure case.


So we have to kmalloc(4K) on every PUT_INDIRECT. Or we can put tces on the
stack (4K is quire a lot for the kernel, no)?



>> +
>> +	return H_PARAMETER;
>> +}
>> +
>> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +	long i;
>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
>> +
>> +	return H_SUCCESS;
>> +}
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 30c2f3b..55fdf7a 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -14,6 +14,7 @@
>>   *
>>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>>   */
>>  
>>  #include <linux/types.h>
>> @@ -35,42 +36,214 @@
>>  #include <asm/ppc-opcode.h>
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>> +#include <asm/iommu.h>
>> +#include <asm/tce.h>
>>  
>>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>  
>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>> - *          mode on PR KVM
>> +/*
>> + * Finds a TCE table descriptor by LIOBN.
>>   */
>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +
>> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
>> +		if (tt->liobn == liobn)
>> +			return tt;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>> +
>> +/*
>> + * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
>> + * by QEMU. It puts guest TCE values into the table and expects
>> + * the QEMU to convert them later in the QEMU device implementation.
>> + * Works in both real and virtual modes.
>> + */
>> +long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
>> +		unsigned long ioba, unsigned long tce)
>> +{
>> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> +	struct page *page;
>> +	u64 *tbl;
>> +
>> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
>> +	/*	    liobn, tt, tt->window_size); */
>> +	if (ioba >= tt->window_size) {
>> +		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
>> +		return H_PARAMETER;
>> +	}
>> +	/*
>> +	 * Note on the use of page_address() in real mode,
>> +	 *
>> +	 * It is safe to use page_address() in real mode on ppc64 because
>> +	 * page_address() is always defined as lowmem_page_address()
>> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>> +	 * operation and does not access page struct.
>> +	 *
>> +	 * Theoretically page_address() could be defined different
>> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>> +	 * should be enabled.
>> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>> +	 * is not expected to be enabled on ppc32, page_address()
>> +	 * is safe for ppc32 as well.
>> +	 */
>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>> +#error TODO: fix to avoid page_address() here
>> +#endif
>> +	page = tt->pages[idx / TCES_PER_PAGE];
>> +	tbl = (u64 *)page_address(page);
>> +
>> +	/*
>> +	 * Validate TCE address.
>> +	 * At the moment only flags are validated
>> +	 * as other check will significantly slow down
>> +	 * or can make it even impossible to handle TCE requests
>> +	 * in real mode.
>> +	 */
>> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>> +		return H_PARAMETER;
>> +
>> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> +	tbl[idx % TCES_PER_PAGE] = tce;
>> +
>> +	return H_SUCCESS;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
>> +
>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>> +/*
>> + * Converts guest physical address into host real address.
>> + * Also returns pte and page size if the page is present in page table.
>> + */
>> +static unsigned long get_real_address(struct kvm_vcpu *vcpu,
>> +		unsigned long gpa, bool writing,
>> +		pte_t *ptep, unsigned long *pg_sizep)
> 
> The only caller doesn't use the ptep and pg_sizep pointers, so there's
> no point implementing them.


"KVM: PPC: Add support for IOMMU in-kernel handling" will. Is there much
sense in splitting this quite small function between patches?




-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
@ 2013-05-10  7:53       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 44+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-10  7:53 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, kvm, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, Joerg Roedel

On 05/10/2013 04:51 PM, David Gibson wrote:
> On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
>> devices or emulated PCI.  These calls allow adding multiple entries
>> (up to 512) into the TCE table in one call which saves time on
>> transition to/from real mode.
>>
>> This adds a guest physical to host real address converter
>> and calls the existing H_PUT_TCE handler. The converting function
>> is going to be fully utilized by upcoming VFIO supporting patches.
>>
>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>> so in order to support the functionality of this patch, QEMU
>> needs to query for this capability and set the "hcall-multi-tce"
>> hypertas property only if the capability is present, otherwise
>> there will be serious performance degradation.
> 
> 
> Hrm.  Clearly I didn't read this carefully enough before.  There are
> some problems here.

?


> [snip]
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 72ffc89..643ac1e 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -14,6 +14,7 @@
>>   *
>>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>>   */
>>  
>>  #include <linux/types.h>
>> @@ -36,9 +37,14 @@
>>  #include <asm/ppc-opcode.h>
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>> +#include <asm/iommu.h>
>>  
>>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>  
>> +/*
>> + * TCE tables handlers.
>> + */
>>  static long kvmppc_stt_npages(unsigned long window_size)
>>  {
>>  	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
>> @@ -148,3 +154,111 @@ fail:
>>  	}
>>  	return ret;
>>  }
>> +
>> +/*
>> + * Virtual mode handling of IOMMU map/unmap.
>> + */
>> +/* Converts guest physical address into host virtual */
>> +static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
>> +		unsigned long gpa)
> 
> This should probably return a void * rather than an unsigned long.
> Well, actually a void __user *.
> 
>> +{
>> +	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
>> +	struct kvm_memory_slot *memslot;
>> +
>> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>> +	if (!memslot)
>> +		return ERROR_ADDR;
>> +
>> +	/*
>> +	 * Convert gfn to hva preserving flags and an offset
>> +	 * within a system page
>> +	 */
>> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
>> +	return hva;
>> +}
>> +
>> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>> +}
>> +
>> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_list, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +	long i;
>> +	unsigned long tces;
>> +
>> +	/* The whole table addressed by tce_list resides in 4K page */
>> +	if (npages > 512)
>> +		return H_PARAMETER;
> 
> So, that doesn't actually verify what the comment says it does - only
> that the list is < 4K in total.  You need to check the alignment of
> tce_list as well.



The spec says to return H_PARAMETER if >512. I.e. it takes just 1 page and
I do not need to bother if pages may not lay continuously in RAM (matters
for real mode).

/*
 * As the spec is saying that maximum possible number of TCEs is 512,
 * the whole TCE page is no more than 4K. Therefore we do not have to
 * worry if pages do not lie continuously in the RAM
 */
Any better?...


>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	tces = get_virt_address(vcpu, tce_list);
>> +	if (tces = ERROR_ADDR)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
> 
> This comment doesn't seem to have any bearing on the test which
> follows it.
> 
>> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		unsigned long tce;
>> +		unsigned long ptce = tces + i * sizeof(unsigned long);
>> +
>> +		if (get_user(tce, (unsigned long __user *)ptce))
>> +			break;
>> +
>> +		if (kvmppc_emulated_h_put_tce(tt,
>> +				ioba + (i << IOMMU_PAGE_SHIFT),	tce))
>> +			break;
>> +	}
>> +	if (i = npages)
>> +		return H_SUCCESS;
>> +
>> +	/* Failed, do cleanup */
>> +	do {
>> +		--i;
>> +		kvmppc_emulated_h_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
>> +				0);
>> +	} while (i);
> 
> Hrm, so, actually PAPR specifies that this hcall is supposed to first
> copy the given tces to hypervisor memory, then translate (and
> validate) them all, and only then touch the actual TCE table.  Rather
> more complicated to do, but I guess we should - that would get rid of
> the need for this partial cleanup in the failure case.


So we have to kmalloc(4K) on every PUT_INDIRECT. Or we can put tces on the
stack (4K is quire a lot for the kernel, no)?



>> +
>> +	return H_PARAMETER;
>> +}
>> +
>> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn, unsigned long ioba,
>> +		unsigned long tce_value, unsigned long npages)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +	long i;
>> +
>> +	tt = kvmppc_find_tce_table(vcpu, liobn);
>> +	/* Didn't find the liobn, put it to userspace */
>> +	if (!tt)
>> +		return H_TOO_HARD;
>> +
>> +	/* Emulated IO */
>> +	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>> +		return H_PARAMETER;
>> +
>> +	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
>> +		kvmppc_emulated_h_put_tce(tt, ioba, tce_value);
>> +
>> +	return H_SUCCESS;
>> +}
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 30c2f3b..55fdf7a 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -14,6 +14,7 @@
>>   *
>>   * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
>>   * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
>>   */
>>  
>>  #include <linux/types.h>
>> @@ -35,42 +36,214 @@
>>  #include <asm/ppc-opcode.h>
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>> +#include <asm/iommu.h>
>> +#include <asm/tce.h>
>>  
>>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>  
>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>> - *          mode on PR KVM
>> +/*
>> + * Finds a TCE table descriptor by LIOBN.
>>   */
>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>> +		unsigned long liobn)
>> +{
>> +	struct kvmppc_spapr_tce_table *tt;
>> +
>> +	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
>> +		if (tt->liobn = liobn)
>> +			return tt;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>> +
>> +/*
>> + * kvmppc_emulated_h_put_tce() handles TCE requests for devices emulated
>> + * by QEMU. It puts guest TCE values into the table and expects
>> + * the QEMU to convert them later in the QEMU device implementation.
>> + * Works in both real and virtual modes.
>> + */
>> +long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *tt,
>> +		unsigned long ioba, unsigned long tce)
>> +{
>> +	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>> +	struct page *page;
>> +	u64 *tbl;
>> +
>> +	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => tt=%p  window_size=0x%x\n", */
>> +	/*	    liobn, tt, tt->window_size); */
>> +	if (ioba >= tt->window_size) {
>> +		/* pr_err("%s failed on ioba=%lx\n", __func__, ioba); */
>> +		return H_PARAMETER;
>> +	}
>> +	/*
>> +	 * Note on the use of page_address() in real mode,
>> +	 *
>> +	 * It is safe to use page_address() in real mode on ppc64 because
>> +	 * page_address() is always defined as lowmem_page_address()
>> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>> +	 * operation and does not access page struct.
>> +	 *
>> +	 * Theoretically page_address() could be defined different
>> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>> +	 * should be enabled.
>> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>> +	 * is not expected to be enabled on ppc32, page_address()
>> +	 * is safe for ppc32 as well.
>> +	 */
>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>> +#error TODO: fix to avoid page_address() here
>> +#endif
>> +	page = tt->pages[idx / TCES_PER_PAGE];
>> +	tbl = (u64 *)page_address(page);
>> +
>> +	/*
>> +	 * Validate TCE address.
>> +	 * At the moment only flags are validated
>> +	 * as other check will significantly slow down
>> +	 * or can make it even impossible to handle TCE requests
>> +	 * in real mode.
>> +	 */
>> +	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>> +		return H_PARAMETER;
>> +
>> +	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
>> +	tbl[idx % TCES_PER_PAGE] = tce;
>> +
>> +	return H_SUCCESS;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_h_put_tce);
>> +
>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>> +/*
>> + * Converts guest physical address into host real address.
>> + * Also returns pte and page size if the page is present in page table.
>> + */
>> +static unsigned long get_real_address(struct kvm_vcpu *vcpu,
>> +		unsigned long gpa, bool writing,
>> +		pte_t *ptep, unsigned long *pg_sizep)
> 
> The only caller doesn't use the ptep and pg_sizep pointers, so there's
> no point implementing them.


"KVM: PPC: Add support for IOMMU in-kernel handling" will. Is there much
sense in splitting this quite small function between patches?




-- 
Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2013-05-10  7:54 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-06  7:25 [PATCH 0/6] KVM: PPC: IOMMU in-kernel handling Alexey Kardashevskiy
2013-05-06  7:25 ` Alexey Kardashevskiy
2013-05-06  7:25 ` Alexey Kardashevskiy
2013-05-06  7:25 ` [PATCH 1/6] KVM: PPC: Make lookup_linux_pte public Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25 ` [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-07  5:05   ` David Gibson
2013-05-07  5:05     ` David Gibson
2013-05-07  5:05     ` David Gibson
2013-05-10  6:51   ` David Gibson
2013-05-10  6:51     ` David Gibson
2013-05-10  6:51     ` David Gibson
2013-05-10  7:53     ` Alexey Kardashevskiy
2013-05-10  7:53       ` Alexey Kardashevskiy
2013-05-10  7:53       ` Alexey Kardashevskiy
2013-05-06  7:25 ` [PATCH 3/6] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25 ` [PATCH 4/6] iommu: Add a function to find an iommu group by id Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25 ` [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-07  5:29   ` David Gibson
2013-05-07  5:29     ` David Gibson
2013-05-07  5:29     ` David Gibson
2013-05-07  5:51     ` Alexey Kardashevskiy
2013-05-07  5:51       ` Alexey Kardashevskiy
2013-05-07  5:51       ` Alexey Kardashevskiy
2013-05-07  6:02       ` David Gibson
2013-05-07  6:02         ` David Gibson
2013-05-07  6:02         ` David Gibson
2013-05-07  6:27         ` Alexey Kardashevskiy
2013-05-07  6:27           ` Alexey Kardashevskiy
2013-05-07  6:27           ` Alexey Kardashevskiy
2013-05-07  6:54           ` David Gibson
2013-05-07  6:54             ` David Gibson
2013-05-06  7:25 ` [PATCH 6/6] KVM: PPC: Add hugepage " Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy
2013-05-06  7:25   ` Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.