All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] KVM/ARM Huge pages support
@ 2013-08-09  4:07 ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09  4:07 UTC (permalink / raw)
  To: kvmarm; +Cc: kvm, linux-arm-kernel, linaro-kernel, patches, Christoffer Dall

This small series adds support for Transparent Huge Pages and hugetlbfs
pages for KVM on the arm and arm64 architectures.

Measurements have shown that using huge pages in for stage-2 mappings
provides up to more than 15% performance increase for some workloads.

The patch series applies to kvm/next, but depends on the patch sent
earlier:

"ARM: KVM: Fix unaligned unmap_range leak"

Christoffer Dall (3):
  KVM: Move gfn_to_index to x86 specific code
  KVM: ARM: Get rid of KVM_HPAGE_XXX defines
  KVM: ARM: Transparent huge pages and hugetlbfs support

 arch/arm/include/asm/kvm_host.h  |    5 -
 arch/arm/include/asm/kvm_mmu.h   |   17 +++-
 arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
 arch/arm64/include/asm/kvm_mmu.h |   12 ++-
 arch/x86/include/asm/kvm_host.h  |    7 ++
 include/linux/kvm_host.h         |    7 --
 6 files changed, 201 insertions(+), 47 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 0/3] KVM/ARM Huge pages support
@ 2013-08-09  4:07 ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09  4:07 UTC (permalink / raw)
  To: linux-arm-kernel

This small series adds support for Transparent Huge Pages and hugetlbfs
pages for KVM on the arm and arm64 architectures.

Measurements have shown that using huge pages in for stage-2 mappings
provides up to more than 15% performance increase for some workloads.

The patch series applies to kvm/next, but depends on the patch sent
earlier:

"ARM: KVM: Fix unaligned unmap_range leak"

Christoffer Dall (3):
  KVM: Move gfn_to_index to x86 specific code
  KVM: ARM: Get rid of KVM_HPAGE_XXX defines
  KVM: ARM: Transparent huge pages and hugetlbfs support

 arch/arm/include/asm/kvm_host.h  |    5 -
 arch/arm/include/asm/kvm_mmu.h   |   17 +++-
 arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
 arch/arm64/include/asm/kvm_mmu.h |   12 ++-
 arch/x86/include/asm/kvm_host.h  |    7 ++
 include/linux/kvm_host.h         |    7 --
 6 files changed, 201 insertions(+), 47 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/3] KVM: Move gfn_to_index to x86 specific code
  2013-08-09  4:07 ` Christoffer Dall
@ 2013-08-09  4:07   ` Christoffer Dall
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09  4:07 UTC (permalink / raw)
  To: kvmarm; +Cc: kvm, linux-arm-kernel, linaro-kernel, patches, Christoffer Dall

The gfn_to_index function relies on huge page defines which either may
not make sense on systems that don't support huge pages or are defined
in an unconvenient way for other architectures.  Since this is
x86-specific, move the function to arch/x86/include/asm/kvm_host.h.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---
 arch/x86/include/asm/kvm_host.h |    7 +++++++
 include/linux/kvm_host.h        |    7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f87f7fc..fca7f2e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -79,6 +79,13 @@
 #define KVM_HPAGE_MASK(x)	(~(KVM_HPAGE_SIZE(x) - 1))
 #define KVM_PAGES_PER_HPAGE(x)	(KVM_HPAGE_SIZE(x) / PAGE_SIZE)
 
+static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
+{
+	/* KVM_HPAGE_GFN_SHIFT(PT_PAGE_TABLE_LEVEL) must be 0. */
+	return (gfn >> KVM_HPAGE_GFN_SHIFT(level)) -
+		(base_gfn >> KVM_HPAGE_GFN_SHIFT(level));
+}
+
 #define SELECTOR_TI_MASK (1 << 2)
 #define SELECTOR_RPL_MASK 0x03
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a63d83e..635e098 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -828,13 +828,6 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return gfn_to_memslot(kvm, gfn)->id;
 }
 
-static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
-{
-	/* KVM_HPAGE_GFN_SHIFT(PT_PAGE_TABLE_LEVEL) must be 0. */
-	return (gfn >> KVM_HPAGE_GFN_SHIFT(level)) -
-		(base_gfn >> KVM_HPAGE_GFN_SHIFT(level));
-}
-
 static inline gfn_t
 hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/3] KVM: Move gfn_to_index to x86 specific code
@ 2013-08-09  4:07   ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09  4:07 UTC (permalink / raw)
  To: linux-arm-kernel

The gfn_to_index function relies on huge page defines which either may
not make sense on systems that don't support huge pages or are defined
in an unconvenient way for other architectures.  Since this is
x86-specific, move the function to arch/x86/include/asm/kvm_host.h.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---
 arch/x86/include/asm/kvm_host.h |    7 +++++++
 include/linux/kvm_host.h        |    7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f87f7fc..fca7f2e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -79,6 +79,13 @@
 #define KVM_HPAGE_MASK(x)	(~(KVM_HPAGE_SIZE(x) - 1))
 #define KVM_PAGES_PER_HPAGE(x)	(KVM_HPAGE_SIZE(x) / PAGE_SIZE)
 
+static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
+{
+	/* KVM_HPAGE_GFN_SHIFT(PT_PAGE_TABLE_LEVEL) must be 0. */
+	return (gfn >> KVM_HPAGE_GFN_SHIFT(level)) -
+		(base_gfn >> KVM_HPAGE_GFN_SHIFT(level));
+}
+
 #define SELECTOR_TI_MASK (1 << 2)
 #define SELECTOR_RPL_MASK 0x03
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a63d83e..635e098 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -828,13 +828,6 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return gfn_to_memslot(kvm, gfn)->id;
 }
 
-static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
-{
-	/* KVM_HPAGE_GFN_SHIFT(PT_PAGE_TABLE_LEVEL) must be 0. */
-	return (gfn >> KVM_HPAGE_GFN_SHIFT(level)) -
-		(base_gfn >> KVM_HPAGE_GFN_SHIFT(level));
-}
-
 static inline gfn_t
 hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
 {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_XXX defines
  2013-08-09  4:07 ` Christoffer Dall
@ 2013-08-09  4:07   ` Christoffer Dall
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09  4:07 UTC (permalink / raw)
  To: kvmarm; +Cc: Christoffer Dall, linaro-kernel, linux-arm-kernel, kvm, patches

The KVM_HPAGE_DEFINES are a little artificial on ARM, since the huge
page size is statically defined at compile time and there is only a
single huge page size.

Now when the main kvm code relying on these defines has been moved to
the x86 specific part of the world, we can get rid of these.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---
 arch/arm/include/asm/kvm_host.h |    5 -----
 1 file changed, 5 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 7d22517..e45a74b 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -38,11 +38,6 @@
 
 #define KVM_VCPU_MAX_FEATURES 1
 
-/* We don't currently support large pages. */
-#define KVM_HPAGE_GFN_SHIFT(x)	0
-#define KVM_NR_PAGE_SIZES	1
-#define KVM_PAGES_PER_HPAGE(x)	(1UL<<31)
-
 #include <kvm/arm_vgic.h>
 
 struct kvm_vcpu;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_XXX defines
@ 2013-08-09  4:07   ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09  4:07 UTC (permalink / raw)
  To: linux-arm-kernel

The KVM_HPAGE_DEFINES are a little artificial on ARM, since the huge
page size is statically defined at compile time and there is only a
single huge page size.

Now when the main kvm code relying on these defines has been moved to
the x86 specific part of the world, we can get rid of these.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---
 arch/arm/include/asm/kvm_host.h |    5 -----
 1 file changed, 5 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 7d22517..e45a74b 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -38,11 +38,6 @@
 
 #define KVM_VCPU_MAX_FEATURES 1
 
-/* We don't currently support large pages. */
-#define KVM_HPAGE_GFN_SHIFT(x)	0
-#define KVM_NR_PAGE_SIZES	1
-#define KVM_PAGES_PER_HPAGE(x)	(1UL<<31)
-
 #include <kvm/arm_vgic.h>
 
 struct kvm_vcpu;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
  2013-08-09  4:07 ` Christoffer Dall
@ 2013-08-09  4:07   ` Christoffer Dall
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09  4:07 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, linux-arm-kernel, linaro-kernel, patches, Christoffer Dall,
	Christoffer Dall

From: Christoffer Dall <cdall@cs.columbia.edu>

Support transparent huge pages in KVM/ARM 32-bit and 64-bit.  The whole
transparent_hugepage_adjust stuff is far from pretty, but this is how
it's solved on x86 so we duplicate their logic.  This should be shared
across architectures if possible (like many other things), but can
always be changed down the road.

The pud_huge checking on the unmap path may feel a bit silly as the
pud_huge check is always defined to false, but the compiler should be
smart about this.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---
 arch/arm/include/asm/kvm_mmu.h   |   17 +++-
 arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
 arch/arm64/include/asm/kvm_mmu.h |   12 ++-
 3 files changed, 194 insertions(+), 35 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 472ac70..ff3ff67 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
 int kvm_mmu_init(void);
 void kvm_clear_hyp_idmap(void);
 
+static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
+{
+	*pmd = new_pmd;
+	flush_pmd_entry(pmd);
+}
+
 static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
 {
 	pte_val(*pte) = new_pte;
@@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
 	pte_val(*pte) |= L_PTE_S2_RDWR;
 }
 
+static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
+{
+	pmd_val(*pmd) |= L_PTE_S2_RDWR;
+}
+
 struct kvm;
 
-static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
+static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
+					      unsigned long size)
 {
 	/*
 	 * If we are going to insert an instruction page and the icache is
@@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
 	 * need any kind of flushing (DDI 0406C.b - Page B3-1392).
 	 */
 	if (icache_is_pipt()) {
-		unsigned long hva = gfn_to_hva(kvm, gfn);
-		__cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
+		__cpuc_coherent_user_range(hva, hva + size);
 	} else if (!icache_is_vivt_asid_tagged()) {
 		/* any kind of VIPT cache */
 		__flush_icache_all();
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 0988d9e..26ced77 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -19,6 +19,7 @@
 #include <linux/mman.h>
 #include <linux/kvm_host.h>
 #include <linux/io.h>
+#include <linux/hugetlb.h>
 #include <trace/events/kvm.h>
 #include <asm/pgalloc.h>
 #include <asm/cacheflush.h>
@@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
 static unsigned long hyp_idmap_end;
 static phys_addr_t hyp_idmap_vector;
 
+#define kvm_pmd_huge(_x)	(pmd_huge(_x) || pmd_trans_huge(_x))
+
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
 	/*
@@ -93,19 +96,27 @@ static bool page_empty(void *ptr)
 
 static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
 {
-	pmd_t *pmd_table = pmd_offset(pud, 0);
-	pud_clear(pud);
-	kvm_tlb_flush_vmid_ipa(kvm, addr);
-	pmd_free(NULL, pmd_table);
+	if (pud_huge(*pud)) {
+		pud_clear(pud);
+	} else {
+		pmd_t *pmd_table = pmd_offset(pud, 0);
+		pud_clear(pud);
+		kvm_tlb_flush_vmid_ipa(kvm, addr);
+		pmd_free(NULL, pmd_table);
+	}
 	put_page(virt_to_page(pud));
 }
 
 static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
 {
-	pte_t *pte_table = pte_offset_kernel(pmd, 0);
-	pmd_clear(pmd);
-	kvm_tlb_flush_vmid_ipa(kvm, addr);
-	pte_free_kernel(NULL, pte_table);
+	if (kvm_pmd_huge(*pmd)) {
+		pmd_clear(pmd);
+	} else {
+		pte_t *pte_table = pte_offset_kernel(pmd, 0);
+		pmd_clear(pmd);
+		kvm_tlb_flush_vmid_ipa(kvm, addr);
+		pte_free_kernel(NULL, pte_table);
+	}
 	put_page(virt_to_page(pmd));
 }
 
@@ -136,12 +147,37 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
 			continue;
 		}
 
+		if (pud_huge(*pud)) {
+			/*
+			 * If we are dealing with a huge pud, just clear it and
+			 * move on.
+			 */
+			clear_pud_entry(kvm, pud, addr);
+			addr = pud_addr_end(addr, end);
+			continue;
+		}
+
 		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
 			addr = pmd_addr_end(addr, end);
 			continue;
 		}
 
+		if (kvm_pmd_huge(*pmd)) {
+			/*
+			 * If we are dealing with a huge pmd, just clear it and
+			 * walk back up the ladder.
+			 */
+			clear_pmd_entry(kvm, pmd, addr);
+			next = pmd_addr_end(addr, end);
+			if (page_empty(pmd) && !page_empty(pud)) {
+				clear_pud_entry(kvm, pud, addr);
+				next = pud_addr_end(addr, end);
+			}
+			addr = next;
+			continue;
+		}
+
 		pte = pte_offset_kernel(pmd, addr);
 		clear_pte_entry(kvm, pte, addr);
 		next = addr + PAGE_SIZE;
@@ -420,6 +456,46 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
 	kvm->arch.pgd = NULL;
 }
 
+/* Set a huge page pmd entry */
+static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
+			  phys_addr_t addr, const pmd_t *new_pmd)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, old_pmd;
+
+	/* Create stage-2 page table mapping - Level 1 */
+	pgd = kvm->arch.pgd + pgd_index(addr);
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud)) {
+		if (!cache)
+			return 0; /* ignore calls from kvm_set_spte_hva */
+		pmd = mmu_memory_cache_alloc(cache);
+		pud_populate(NULL, pud, pmd);
+		get_page(virt_to_page(pud));
+	}
+
+	pmd = pmd_offset(pud, addr);
+
+	/*
+	 * Mapping in huge pages should only happen through a fault.  If a
+	 * page is merged into a transparent huge page, the individual
+	 * subpages of that huge page should be unmapped through MMU
+	 * notifiers before we get here.
+	 *
+	 * Merging of CompoundPages is not supported; they should become
+	 * splitting first, unmapped, merged, and mapped back in on-demand.
+	 */
+	VM_BUG_ON(pmd_present(*pmd) && pmd_pfn(*pmd) != pmd_pfn(*new_pmd));
+
+	old_pmd = *pmd;
+	kvm_set_pmd(pmd, *new_pmd);
+	if (pmd_present(old_pmd))
+		kvm_tlb_flush_vmid_ipa(kvm, addr);
+	else
+		get_page(virt_to_page(pmd));
+	return 0;
+}
 
 static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 			  phys_addr_t addr, const pte_t *new_pte, bool iomap)
@@ -442,7 +518,7 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 
 	pmd = pmd_offset(pud, addr);
 
-	/* Create 2nd stage page table mapping - Level 2 */
+	/* Create 2nd stage page mappings - Level 2 */
 	if (pmd_none(*pmd)) {
 		if (!cache)
 			return 0; /* ignore calls from kvm_set_spte_hva */
@@ -508,16 +584,53 @@ out:
 	return ret;
 }
 
+static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
+{
+	pfn_t pfn = *pfnp;
+	gfn_t gfn = *ipap >> PAGE_SHIFT;
+
+	if (PageTransCompound(pfn_to_page(pfn))) {
+		unsigned long mask;
+		/*
+		 * mmu_notifier_retry was successful and we hold the
+		 * mmu_lock here, so the pmd can't become splitting
+		 * from under us, and in turn
+		 * __split_huge_page_refcount() can't run from under
+		 * us and we can safely transfer the refcount from
+		 * PG_tail to PG_head as we switch the pfn from tail to
+		 * head.
+		 */
+		mask = (PMD_SIZE / PAGE_SIZE) - 1;
+		VM_BUG_ON((gfn & mask) != (pfn & mask));
+		if (pfn & mask) {
+			gfn &= ~mask;
+			*ipap &= ~(PMD_SIZE - 1);
+			kvm_release_pfn_clean(pfn);
+			pfn &= ~mask;
+			kvm_get_pfn(pfn);
+			*pfnp = pfn;
+		}
+
+		return true;
+	}
+
+	return false;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-			  gfn_t gfn, struct kvm_memory_slot *memslot,
+			  struct kvm_memory_slot *memslot,
 			  unsigned long fault_status)
 {
-	pte_t new_pte;
-	pfn_t pfn;
 	int ret;
-	bool write_fault, writable;
+	bool write_fault, writable, hugetlb = false, force_pte = false;
 	unsigned long mmu_seq;
+	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
+	unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
+	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+	struct vm_area_struct *vma;
+	pfn_t pfn;
+	unsigned long psize;
 
 	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
 	if (fault_status == FSC_PERM && !write_fault) {
@@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
+	/* Let's check if we will get back a huge page */
+	down_read(&current->mm->mmap_sem);
+	vma = find_vma_intersection(current->mm, hva, hva + 1);
+	if (is_vm_hugetlb_page(vma)) {
+		hugetlb = true;
+		hva &= PMD_MASK;
+		gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
+		psize = PMD_SIZE;
+	} else {
+		psize = PAGE_SIZE;
+		if (vma->vm_start & ~PMD_MASK)
+			force_pte = true;
+	}
+	up_read(&current->mm->mmap_sem);
+
+	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (is_error_pfn(pfn))
+		return -EFAULT;
+
+	coherent_icache_guest_page(kvm, hva, psize);
+
 	/* We need minimum second+third level pages */
 	ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
 	if (ret)
@@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	smp_rmb();
 
-	pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
-	if (is_error_pfn(pfn))
-		return -EFAULT;
-
-	new_pte = pfn_pte(pfn, PAGE_S2);
-	coherent_icache_guest_page(vcpu->kvm, gfn);
-
-	spin_lock(&vcpu->kvm->mmu_lock);
-	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
+	spin_lock(&kvm->mmu_lock);
+	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
-	if (writable) {
-		kvm_set_s2pte_writable(&new_pte);
-		kvm_set_pfn_dirty(pfn);
+	if (!hugetlb && !force_pte)
+		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
+
+	if (hugetlb) {
+		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
+		new_pmd = pmd_mkhuge(new_pmd);
+		if (writable) {
+			kvm_set_s2pmd_writable(&new_pmd);
+			kvm_set_pfn_dirty(pfn);
+		}
+		ret = stage2_set_pmd(kvm, memcache, fault_ipa, &new_pmd);
+	} else {
+		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
+		if (writable) {
+			kvm_set_s2pte_writable(&new_pte);
+			kvm_set_pfn_dirty(pfn);
+		}
+		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
 	}
-	stage2_set_pte(vcpu->kvm, memcache, fault_ipa, &new_pte, false);
+
 
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	spin_unlock(&kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
-	return 0;
+	return ret;
 }
 
 /**
@@ -630,7 +772,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 	memslot = gfn_to_memslot(vcpu->kvm, gfn);
 
-	ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot, fault_status);
+	ret = user_mem_abort(vcpu, fault_ipa, memslot, fault_status);
 	if (ret == 0)
 		ret = 1;
 out_unlock:
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index efe609c..e7f3852 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -91,6 +91,7 @@ int kvm_mmu_init(void);
 void kvm_clear_hyp_idmap(void);
 
 #define	kvm_set_pte(ptep, pte)		set_pte(ptep, pte)
+#define	kvm_set_pmd(pmdp, pmd)		set_pmd(pmdp, pmd)
 
 static inline bool kvm_is_write_fault(unsigned long esr)
 {
@@ -116,13 +117,18 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
 	pte_val(*pte) |= PTE_S2_RDWR;
 }
 
+static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
+{
+	pmd_val(*pmd) |= PTE_S2_RDWR;
+}
+
 struct kvm;
 
-static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
+static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
+					      unsigned long size)
 {
 	if (!icache_is_aliasing()) {		/* PIPT */
-		unsigned long hva = gfn_to_hva(kvm, gfn);
-		flush_icache_range(hva, hva + PAGE_SIZE);
+		flush_icache_range(hva, hva + size);
 	} else if (!icache_is_aivivt()) {	/* non ASID-tagged VIVT */
 		/* any kind of VIPT cache */
 		__flush_icache_all();
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
@ 2013-08-09  4:07   ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09  4:07 UTC (permalink / raw)
  To: linux-arm-kernel

From: Christoffer Dall <cdall@cs.columbia.edu>

Support transparent huge pages in KVM/ARM 32-bit and 64-bit.  The whole
transparent_hugepage_adjust stuff is far from pretty, but this is how
it's solved on x86 so we duplicate their logic.  This should be shared
across architectures if possible (like many other things), but can
always be changed down the road.

The pud_huge checking on the unmap path may feel a bit silly as the
pud_huge check is always defined to false, but the compiler should be
smart about this.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---
 arch/arm/include/asm/kvm_mmu.h   |   17 +++-
 arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
 arch/arm64/include/asm/kvm_mmu.h |   12 ++-
 3 files changed, 194 insertions(+), 35 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 472ac70..ff3ff67 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
 int kvm_mmu_init(void);
 void kvm_clear_hyp_idmap(void);
 
+static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
+{
+	*pmd = new_pmd;
+	flush_pmd_entry(pmd);
+}
+
 static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
 {
 	pte_val(*pte) = new_pte;
@@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
 	pte_val(*pte) |= L_PTE_S2_RDWR;
 }
 
+static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
+{
+	pmd_val(*pmd) |= L_PTE_S2_RDWR;
+}
+
 struct kvm;
 
-static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
+static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
+					      unsigned long size)
 {
 	/*
 	 * If we are going to insert an instruction page and the icache is
@@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
 	 * need any kind of flushing (DDI 0406C.b - Page B3-1392).
 	 */
 	if (icache_is_pipt()) {
-		unsigned long hva = gfn_to_hva(kvm, gfn);
-		__cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
+		__cpuc_coherent_user_range(hva, hva + size);
 	} else if (!icache_is_vivt_asid_tagged()) {
 		/* any kind of VIPT cache */
 		__flush_icache_all();
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 0988d9e..26ced77 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -19,6 +19,7 @@
 #include <linux/mman.h>
 #include <linux/kvm_host.h>
 #include <linux/io.h>
+#include <linux/hugetlb.h>
 #include <trace/events/kvm.h>
 #include <asm/pgalloc.h>
 #include <asm/cacheflush.h>
@@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
 static unsigned long hyp_idmap_end;
 static phys_addr_t hyp_idmap_vector;
 
+#define kvm_pmd_huge(_x)	(pmd_huge(_x) || pmd_trans_huge(_x))
+
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
 	/*
@@ -93,19 +96,27 @@ static bool page_empty(void *ptr)
 
 static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
 {
-	pmd_t *pmd_table = pmd_offset(pud, 0);
-	pud_clear(pud);
-	kvm_tlb_flush_vmid_ipa(kvm, addr);
-	pmd_free(NULL, pmd_table);
+	if (pud_huge(*pud)) {
+		pud_clear(pud);
+	} else {
+		pmd_t *pmd_table = pmd_offset(pud, 0);
+		pud_clear(pud);
+		kvm_tlb_flush_vmid_ipa(kvm, addr);
+		pmd_free(NULL, pmd_table);
+	}
 	put_page(virt_to_page(pud));
 }
 
 static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
 {
-	pte_t *pte_table = pte_offset_kernel(pmd, 0);
-	pmd_clear(pmd);
-	kvm_tlb_flush_vmid_ipa(kvm, addr);
-	pte_free_kernel(NULL, pte_table);
+	if (kvm_pmd_huge(*pmd)) {
+		pmd_clear(pmd);
+	} else {
+		pte_t *pte_table = pte_offset_kernel(pmd, 0);
+		pmd_clear(pmd);
+		kvm_tlb_flush_vmid_ipa(kvm, addr);
+		pte_free_kernel(NULL, pte_table);
+	}
 	put_page(virt_to_page(pmd));
 }
 
@@ -136,12 +147,37 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
 			continue;
 		}
 
+		if (pud_huge(*pud)) {
+			/*
+			 * If we are dealing with a huge pud, just clear it and
+			 * move on.
+			 */
+			clear_pud_entry(kvm, pud, addr);
+			addr = pud_addr_end(addr, end);
+			continue;
+		}
+
 		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
 			addr = pmd_addr_end(addr, end);
 			continue;
 		}
 
+		if (kvm_pmd_huge(*pmd)) {
+			/*
+			 * If we are dealing with a huge pmd, just clear it and
+			 * walk back up the ladder.
+			 */
+			clear_pmd_entry(kvm, pmd, addr);
+			next = pmd_addr_end(addr, end);
+			if (page_empty(pmd) && !page_empty(pud)) {
+				clear_pud_entry(kvm, pud, addr);
+				next = pud_addr_end(addr, end);
+			}
+			addr = next;
+			continue;
+		}
+
 		pte = pte_offset_kernel(pmd, addr);
 		clear_pte_entry(kvm, pte, addr);
 		next = addr + PAGE_SIZE;
@@ -420,6 +456,46 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
 	kvm->arch.pgd = NULL;
 }
 
+/* Set a huge page pmd entry */
+static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
+			  phys_addr_t addr, const pmd_t *new_pmd)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, old_pmd;
+
+	/* Create stage-2 page table mapping - Level 1 */
+	pgd = kvm->arch.pgd + pgd_index(addr);
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud)) {
+		if (!cache)
+			return 0; /* ignore calls from kvm_set_spte_hva */
+		pmd = mmu_memory_cache_alloc(cache);
+		pud_populate(NULL, pud, pmd);
+		get_page(virt_to_page(pud));
+	}
+
+	pmd = pmd_offset(pud, addr);
+
+	/*
+	 * Mapping in huge pages should only happen through a fault.  If a
+	 * page is merged into a transparent huge page, the individual
+	 * subpages of that huge page should be unmapped through MMU
+	 * notifiers before we get here.
+	 *
+	 * Merging of CompoundPages is not supported; they should become
+	 * splitting first, unmapped, merged, and mapped back in on-demand.
+	 */
+	VM_BUG_ON(pmd_present(*pmd) && pmd_pfn(*pmd) != pmd_pfn(*new_pmd));
+
+	old_pmd = *pmd;
+	kvm_set_pmd(pmd, *new_pmd);
+	if (pmd_present(old_pmd))
+		kvm_tlb_flush_vmid_ipa(kvm, addr);
+	else
+		get_page(virt_to_page(pmd));
+	return 0;
+}
 
 static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 			  phys_addr_t addr, const pte_t *new_pte, bool iomap)
@@ -442,7 +518,7 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 
 	pmd = pmd_offset(pud, addr);
 
-	/* Create 2nd stage page table mapping - Level 2 */
+	/* Create 2nd stage page mappings - Level 2 */
 	if (pmd_none(*pmd)) {
 		if (!cache)
 			return 0; /* ignore calls from kvm_set_spte_hva */
@@ -508,16 +584,53 @@ out:
 	return ret;
 }
 
+static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
+{
+	pfn_t pfn = *pfnp;
+	gfn_t gfn = *ipap >> PAGE_SHIFT;
+
+	if (PageTransCompound(pfn_to_page(pfn))) {
+		unsigned long mask;
+		/*
+		 * mmu_notifier_retry was successful and we hold the
+		 * mmu_lock here, so the pmd can't become splitting
+		 * from under us, and in turn
+		 * __split_huge_page_refcount() can't run from under
+		 * us and we can safely transfer the refcount from
+		 * PG_tail to PG_head as we switch the pfn from tail to
+		 * head.
+		 */
+		mask = (PMD_SIZE / PAGE_SIZE) - 1;
+		VM_BUG_ON((gfn & mask) != (pfn & mask));
+		if (pfn & mask) {
+			gfn &= ~mask;
+			*ipap &= ~(PMD_SIZE - 1);
+			kvm_release_pfn_clean(pfn);
+			pfn &= ~mask;
+			kvm_get_pfn(pfn);
+			*pfnp = pfn;
+		}
+
+		return true;
+	}
+
+	return false;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-			  gfn_t gfn, struct kvm_memory_slot *memslot,
+			  struct kvm_memory_slot *memslot,
 			  unsigned long fault_status)
 {
-	pte_t new_pte;
-	pfn_t pfn;
 	int ret;
-	bool write_fault, writable;
+	bool write_fault, writable, hugetlb = false, force_pte = false;
 	unsigned long mmu_seq;
+	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
+	unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
+	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+	struct vm_area_struct *vma;
+	pfn_t pfn;
+	unsigned long psize;
 
 	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
 	if (fault_status == FSC_PERM && !write_fault) {
@@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
+	/* Let's check if we will get back a huge page */
+	down_read(&current->mm->mmap_sem);
+	vma = find_vma_intersection(current->mm, hva, hva + 1);
+	if (is_vm_hugetlb_page(vma)) {
+		hugetlb = true;
+		hva &= PMD_MASK;
+		gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
+		psize = PMD_SIZE;
+	} else {
+		psize = PAGE_SIZE;
+		if (vma->vm_start & ~PMD_MASK)
+			force_pte = true;
+	}
+	up_read(&current->mm->mmap_sem);
+
+	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (is_error_pfn(pfn))
+		return -EFAULT;
+
+	coherent_icache_guest_page(kvm, hva, psize);
+
 	/* We need minimum second+third level pages */
 	ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
 	if (ret)
@@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	smp_rmb();
 
-	pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
-	if (is_error_pfn(pfn))
-		return -EFAULT;
-
-	new_pte = pfn_pte(pfn, PAGE_S2);
-	coherent_icache_guest_page(vcpu->kvm, gfn);
-
-	spin_lock(&vcpu->kvm->mmu_lock);
-	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
+	spin_lock(&kvm->mmu_lock);
+	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
-	if (writable) {
-		kvm_set_s2pte_writable(&new_pte);
-		kvm_set_pfn_dirty(pfn);
+	if (!hugetlb && !force_pte)
+		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
+
+	if (hugetlb) {
+		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
+		new_pmd = pmd_mkhuge(new_pmd);
+		if (writable) {
+			kvm_set_s2pmd_writable(&new_pmd);
+			kvm_set_pfn_dirty(pfn);
+		}
+		ret = stage2_set_pmd(kvm, memcache, fault_ipa, &new_pmd);
+	} else {
+		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
+		if (writable) {
+			kvm_set_s2pte_writable(&new_pte);
+			kvm_set_pfn_dirty(pfn);
+		}
+		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
 	}
-	stage2_set_pte(vcpu->kvm, memcache, fault_ipa, &new_pte, false);
+
 
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	spin_unlock(&kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
-	return 0;
+	return ret;
 }
 
 /**
@@ -630,7 +772,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 	memslot = gfn_to_memslot(vcpu->kvm, gfn);
 
-	ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot, fault_status);
+	ret = user_mem_abort(vcpu, fault_ipa, memslot, fault_status);
 	if (ret == 0)
 		ret = 1;
 out_unlock:
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index efe609c..e7f3852 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -91,6 +91,7 @@ int kvm_mmu_init(void);
 void kvm_clear_hyp_idmap(void);
 
 #define	kvm_set_pte(ptep, pte)		set_pte(ptep, pte)
+#define	kvm_set_pmd(pmdp, pmd)		set_pmd(pmdp, pmd)
 
 static inline bool kvm_is_write_fault(unsigned long esr)
 {
@@ -116,13 +117,18 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
 	pte_val(*pte) |= PTE_S2_RDWR;
 }
 
+static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
+{
+	pmd_val(*pmd) |= PTE_S2_RDWR;
+}
+
 struct kvm;
 
-static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
+static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
+					      unsigned long size)
 {
 	if (!icache_is_aliasing()) {		/* PIPT */
-		unsigned long hva = gfn_to_hva(kvm, gfn);
-		flush_icache_range(hva, hva + PAGE_SIZE);
+		flush_icache_range(hva, hva + size);
 	} else if (!icache_is_aivivt()) {	/* non ASID-tagged VIVT */
 		/* any kind of VIPT cache */
 		__flush_icache_all();
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
  2013-08-09  4:07 ` Christoffer Dall
@ 2013-08-09 14:30   ` Christoffer Dall
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09 14:30 UTC (permalink / raw)
  To: kvmarm; +Cc: kvm, linux-arm-kernel, linaro-kernel, patches, Christoffer Dall

The KVM_HPAGE_DEFINES are a little artificial on ARM, since the huge
page size is statically defined at compile time and there is only a
single huge page size.

Now when the main kvm code relying on these defines has been moved to
the x86 specific part of the world, we can get rid of these.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---

[Resending this because the KVM list filter caught the subject of the
previous message due to the letter X]

 arch/arm/include/asm/kvm_host.h |    5 -----
 1 file changed, 5 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 7d22517..e45a74b 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -38,11 +38,6 @@
 
 #define KVM_VCPU_MAX_FEATURES 1
 
-/* We don't currently support large pages. */
-#define KVM_HPAGE_GFN_SHIFT(x)	0
-#define KVM_NR_PAGE_SIZES	1
-#define KVM_PAGES_PER_HPAGE(x)	(1UL<<31)
-
 #include <kvm/arm_vgic.h>
 
 struct kvm_vcpu;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
@ 2013-08-09 14:30   ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-09 14:30 UTC (permalink / raw)
  To: linux-arm-kernel

The KVM_HPAGE_DEFINES are a little artificial on ARM, since the huge
page size is statically defined at compile time and there is only a
single huge page size.

Now when the main kvm code relying on these defines has been moved to
the x86 specific part of the world, we can get rid of these.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---

[Resending this because the KVM list filter caught the subject of the
previous message due to the letter X]

 arch/arm/include/asm/kvm_host.h |    5 -----
 1 file changed, 5 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 7d22517..e45a74b 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -38,11 +38,6 @@
 
 #define KVM_VCPU_MAX_FEATURES 1
 
-/* We don't currently support large pages. */
-#define KVM_HPAGE_GFN_SHIFT(x)	0
-#define KVM_NR_PAGE_SIZES	1
-#define KVM_PAGES_PER_HPAGE(x)	(1UL<<31)
-
 #include <kvm/arm_vgic.h>
 
 struct kvm_vcpu;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
  2013-08-09 14:30   ` Christoffer Dall
@ 2013-08-25 14:05     ` Gleb Natapov
  -1 siblings, 0 replies; 36+ messages in thread
From: Gleb Natapov @ 2013-08-25 14:05 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: patches, linaro-kernel, kvmarm, linux-arm-kernel, kvm

On Fri, Aug 09, 2013 at 07:30:13AM -0700, Christoffer Dall wrote:
> The KVM_HPAGE_DEFINES are a little artificial on ARM, since the huge
> page size is statically defined at compile time and there is only a
> single huge page size.
> 
> Now when the main kvm code relying on these defines has been moved to
> the x86 specific part of the world, we can get rid of these.
> 
> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> ---
> 
> [Resending this because the KVM list filter caught the subject of the
> previous message due to the letter X]
> 
>  arch/arm/include/asm/kvm_host.h |    5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 7d22517..e45a74b 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -38,11 +38,6 @@
>  
>  #define KVM_VCPU_MAX_FEATURES 1
>  
> -/* We don't currently support large pages. */
> -#define KVM_HPAGE_GFN_SHIFT(x)	0
> -#define KVM_NR_PAGE_SIZES	1
With huge page support you do have two different page sizes, but the only
code that uses those defines currently is in the x86 shadow mmu code,
so I am fine with moving gfn_to_index() into x86 specific code and getting rid
of those defines (you will probably need them when you will implement
nested HYP mode :)). But can you do it as a separate patch series and
remove the defines for all arches?

> -#define KVM_PAGES_PER_HPAGE(x)	(1UL<<31)
> -
>  #include <kvm/arm_vgic.h>
>  
>  struct kvm_vcpu;
> -- 
> 1.7.10.4
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

--
			Gleb.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
@ 2013-08-25 14:05     ` Gleb Natapov
  0 siblings, 0 replies; 36+ messages in thread
From: Gleb Natapov @ 2013-08-25 14:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Aug 09, 2013 at 07:30:13AM -0700, Christoffer Dall wrote:
> The KVM_HPAGE_DEFINES are a little artificial on ARM, since the huge
> page size is statically defined at compile time and there is only a
> single huge page size.
> 
> Now when the main kvm code relying on these defines has been moved to
> the x86 specific part of the world, we can get rid of these.
> 
> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> ---
> 
> [Resending this because the KVM list filter caught the subject of the
> previous message due to the letter X]
> 
>  arch/arm/include/asm/kvm_host.h |    5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 7d22517..e45a74b 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -38,11 +38,6 @@
>  
>  #define KVM_VCPU_MAX_FEATURES 1
>  
> -/* We don't currently support large pages. */
> -#define KVM_HPAGE_GFN_SHIFT(x)	0
> -#define KVM_NR_PAGE_SIZES	1
With huge page support you do have two different page sizes, but the only
code that uses those defines currently is in the x86 shadow mmu code,
so I am fine with moving gfn_to_index() into x86 specific code and getting rid
of those defines (you will probably need them when you will implement
nested HYP mode :)). But can you do it as a separate patch series and
remove the defines for all arches?

> -#define KVM_PAGES_PER_HPAGE(x)	(1UL<<31)
> -
>  #include <kvm/arm_vgic.h>
>  
>  struct kvm_vcpu;
> -- 
> 1.7.10.4
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

--
			Gleb.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
  2013-08-25 14:05     ` Gleb Natapov
@ 2013-08-25 14:29       ` Peter Maydell
  -1 siblings, 0 replies; 36+ messages in thread
From: Peter Maydell @ 2013-08-25 14:29 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Christoffer Dall, Patch Tracking, linaro-kernel, kvmarm,
	arm-mail-list, kvm-devel

On 25 August 2013 15:05, Gleb Natapov <gleb@redhat.com> wrote:
> (you will probably need them when you will implement
> nested HYP mode :)).

Smiley noted, but this is pretty unlikely since it's not possible
to lie to the guest about which mode it's in, so you can't make
a guest think it's in Hyp mode.

-- PMM

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
@ 2013-08-25 14:29       ` Peter Maydell
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Maydell @ 2013-08-25 14:29 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 August 2013 15:05, Gleb Natapov <gleb@redhat.com> wrote:
> (you will probably need them when you will implement
> nested HYP mode :)).

Smiley noted, but this is pretty unlikely since it's not possible
to lie to the guest about which mode it's in, so you can't make
a guest think it's in Hyp mode.

-- PMM

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
  2013-08-25 14:29       ` Peter Maydell
@ 2013-08-25 14:48         ` Gleb Natapov
  -1 siblings, 0 replies; 36+ messages in thread
From: Gleb Natapov @ 2013-08-25 14:48 UTC (permalink / raw)
  To: Peter Maydell
  Cc: linaro-kernel, kvm-devel, Patch Tracking, Christoffer Dall,
	kvmarm, arm-mail-list

On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
> On 25 August 2013 15:05, Gleb Natapov <gleb@redhat.com> wrote:
> > (you will probably need them when you will implement
> > nested HYP mode :)).
> 
> Smiley noted, but this is pretty unlikely since it's not possible
> to lie to the guest about which mode it's in, so you can't make
> a guest think it's in Hyp mode.
> 
I suspected this, but forgot most that I read about Hyp mode by now.
Need to refresh my memory ASAP. Is it impossible even with a lot of
emulation? Can guest detect that it is not in a Hyp mode without
trapping into hypervisor?

--
			Gleb.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
@ 2013-08-25 14:48         ` Gleb Natapov
  0 siblings, 0 replies; 36+ messages in thread
From: Gleb Natapov @ 2013-08-25 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
> On 25 August 2013 15:05, Gleb Natapov <gleb@redhat.com> wrote:
> > (you will probably need them when you will implement
> > nested HYP mode :)).
> 
> Smiley noted, but this is pretty unlikely since it's not possible
> to lie to the guest about which mode it's in, so you can't make
> a guest think it's in Hyp mode.
> 
I suspected this, but forgot most that I read about Hyp mode by now.
Need to refresh my memory ASAP. Is it impossible even with a lot of
emulation? Can guest detect that it is not in a Hyp mode without
trapping into hypervisor?

--
			Gleb.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
  2013-08-25 14:48         ` Gleb Natapov
@ 2013-08-25 15:18           ` Peter Maydell
  -1 siblings, 0 replies; 36+ messages in thread
From: Peter Maydell @ 2013-08-25 15:18 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: linaro-kernel, kvm-devel, Patch Tracking, Christoffer Dall,
	kvmarm, arm-mail-list

On 25 August 2013 15:48, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
>> Smiley noted, but this is pretty unlikely since it's not possible
>> to lie to the guest about which mode it's in, so you can't make
>> a guest think it's in Hyp mode.
>>
> I suspected this, but forgot most that I read about Hyp mode by now.
> Need to refresh my memory ASAP. Is it impossible even with a lot of
> emulation? Can guest detect that it is not in a Hyp mode without
> trapping into hypervisor?

Yes. The current mode is in the the low bits of the CPSR, which
is readable without causing a trap. This is just the most obvious
roadblock; I bet there are more. If you really had to run Hyp mode
code in a VM you probably have to do it by having it all emulated
via TCG.

-- PMM

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
@ 2013-08-25 15:18           ` Peter Maydell
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Maydell @ 2013-08-25 15:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 August 2013 15:48, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
>> Smiley noted, but this is pretty unlikely since it's not possible
>> to lie to the guest about which mode it's in, so you can't make
>> a guest think it's in Hyp mode.
>>
> I suspected this, but forgot most that I read about Hyp mode by now.
> Need to refresh my memory ASAP. Is it impossible even with a lot of
> emulation? Can guest detect that it is not in a Hyp mode without
> trapping into hypervisor?

Yes. The current mode is in the the low bits of the CPSR, which
is readable without causing a trap. This is just the most obvious
roadblock; I bet there are more. If you really had to run Hyp mode
code in a VM you probably have to do it by having it all emulated
via TCG.

-- PMM

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
  2013-08-25 15:18           ` Peter Maydell
@ 2013-08-25 15:27             ` Alexander Graf
  -1 siblings, 0 replies; 36+ messages in thread
From: Alexander Graf @ 2013-08-25 15:27 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gleb Natapov, linaro-kernel, kvm-devel, Patch Tracking, kvmarm,
	arm-mail-list


On 25.08.2013, at 16:18, Peter Maydell wrote:

> On 25 August 2013 15:48, Gleb Natapov <gleb@redhat.com> wrote:
>> On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
>>> Smiley noted, but this is pretty unlikely since it's not possible
>>> to lie to the guest about which mode it's in, so you can't make
>>> a guest think it's in Hyp mode.
>>> 
>> I suspected this, but forgot most that I read about Hyp mode by now.
>> Need to refresh my memory ASAP. Is it impossible even with a lot of
>> emulation? Can guest detect that it is not in a Hyp mode without
>> trapping into hypervisor?
> 
> Yes. The current mode is in the the low bits of the CPSR, which
> is readable without causing a trap. This is just the most obvious
> roadblock; I bet there are more. If you really had to run Hyp mode
> code in a VM you probably have to do it by having it all emulated
> via TCG.

Or in an in-kernel instruction emulator that we have lying around anyways. For kvm-in-kvm that should be good enough, as we only need to execute a few instructions in HYP mode.


Alex


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
@ 2013-08-25 15:27             ` Alexander Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Graf @ 2013-08-25 15:27 UTC (permalink / raw)
  To: linux-arm-kernel


On 25.08.2013, at 16:18, Peter Maydell wrote:

> On 25 August 2013 15:48, Gleb Natapov <gleb@redhat.com> wrote:
>> On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
>>> Smiley noted, but this is pretty unlikely since it's not possible
>>> to lie to the guest about which mode it's in, so you can't make
>>> a guest think it's in Hyp mode.
>>> 
>> I suspected this, but forgot most that I read about Hyp mode by now.
>> Need to refresh my memory ASAP. Is it impossible even with a lot of
>> emulation? Can guest detect that it is not in a Hyp mode without
>> trapping into hypervisor?
> 
> Yes. The current mode is in the the low bits of the CPSR, which
> is readable without causing a trap. This is just the most obvious
> roadblock; I bet there are more. If you really had to run Hyp mode
> code in a VM you probably have to do it by having it all emulated
> via TCG.

Or in an in-kernel instruction emulator that we have lying around anyways. For kvm-in-kvm that should be good enough, as we only need to execute a few instructions in HYP mode.


Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
  2013-08-25 15:18           ` Peter Maydell
@ 2013-08-26  0:46             ` Christoffer Dall
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-26  0:46 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gleb Natapov, Patch Tracking, linaro-kernel, kvmarm,
	arm-mail-list, kvm-devel

On Sun, Aug 25, 2013 at 04:18:02PM +0100, Peter Maydell wrote:
> On 25 August 2013 15:48, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
> >> Smiley noted, but this is pretty unlikely since it's not possible
> >> to lie to the guest about which mode it's in, so you can't make
> >> a guest think it's in Hyp mode.
> >>
> > I suspected this, but forgot most that I read about Hyp mode by now.
> > Need to refresh my memory ASAP. Is it impossible even with a lot of
> > emulation? Can guest detect that it is not in a Hyp mode without
> > trapping into hypervisor?
> 
> Yes. The current mode is in the the low bits of the CPSR, which
> is readable without causing a trap. This is just the most obvious
> roadblock; I bet there are more. If you really had to run Hyp mode
> code in a VM you probably have to do it by having it all emulated
> via TCG.
> 
Or through some highly paravirtualized nested virtualization solution if
that is ever relevant.

-Christoffer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
@ 2013-08-26  0:46             ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-08-26  0:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Aug 25, 2013 at 04:18:02PM +0100, Peter Maydell wrote:
> On 25 August 2013 15:48, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
> >> Smiley noted, but this is pretty unlikely since it's not possible
> >> to lie to the guest about which mode it's in, so you can't make
> >> a guest think it's in Hyp mode.
> >>
> > I suspected this, but forgot most that I read about Hyp mode by now.
> > Need to refresh my memory ASAP. Is it impossible even with a lot of
> > emulation? Can guest detect that it is not in a Hyp mode without
> > trapping into hypervisor?
> 
> Yes. The current mode is in the the low bits of the CPSR, which
> is readable without causing a trap. This is just the most obvious
> roadblock; I bet there are more. If you really had to run Hyp mode
> code in a VM you probably have to do it by having it all emulated
> via TCG.
> 
Or through some highly paravirtualized nested virtualization solution if
that is ever relevant.

-Christoffer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
  2013-08-25 15:27             ` Alexander Graf
@ 2013-08-26 10:55               ` Gleb Natapov
  -1 siblings, 0 replies; 36+ messages in thread
From: Gleb Natapov @ 2013-08-26 10:55 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, linaro-kernel, kvm-devel, Patch Tracking, kvmarm,
	arm-mail-list

On Sun, Aug 25, 2013 at 04:27:14PM +0100, Alexander Graf wrote:
> 
> On 25.08.2013, at 16:18, Peter Maydell wrote:
> 
> > On 25 August 2013 15:48, Gleb Natapov <gleb@redhat.com> wrote:
> >> On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
> >>> Smiley noted, but this is pretty unlikely since it's not possible
> >>> to lie to the guest about which mode it's in, so you can't make
> >>> a guest think it's in Hyp mode.
> >>> 
> >> I suspected this, but forgot most that I read about Hyp mode by now.
> >> Need to refresh my memory ASAP. Is it impossible even with a lot of
> >> emulation? Can guest detect that it is not in a Hyp mode without
> >> trapping into hypervisor?
> > 
> > Yes. The current mode is in the the low bits of the CPSR, which
> > is readable without causing a trap. This is just the most obvious
> > roadblock; I bet there are more. If you really had to run Hyp mode
> > code in a VM you probably have to do it by having it all emulated
> > via TCG.
> 
> Or in an in-kernel instruction emulator that we have lying around anyways. For kvm-in-kvm that should be good enough, as we only need to execute a few instructions in HYP mode.
> 
Will require emulation on each trap to Hyp mode tough. But since you
already have ideas about nested Hyp I consider it done :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines
@ 2013-08-26 10:55               ` Gleb Natapov
  0 siblings, 0 replies; 36+ messages in thread
From: Gleb Natapov @ 2013-08-26 10:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Aug 25, 2013 at 04:27:14PM +0100, Alexander Graf wrote:
> 
> On 25.08.2013, at 16:18, Peter Maydell wrote:
> 
> > On 25 August 2013 15:48, Gleb Natapov <gleb@redhat.com> wrote:
> >> On Sun, Aug 25, 2013 at 03:29:17PM +0100, Peter Maydell wrote:
> >>> Smiley noted, but this is pretty unlikely since it's not possible
> >>> to lie to the guest about which mode it's in, so you can't make
> >>> a guest think it's in Hyp mode.
> >>> 
> >> I suspected this, but forgot most that I read about Hyp mode by now.
> >> Need to refresh my memory ASAP. Is it impossible even with a lot of
> >> emulation? Can guest detect that it is not in a Hyp mode without
> >> trapping into hypervisor?
> > 
> > Yes. The current mode is in the the low bits of the CPSR, which
> > is readable without causing a trap. This is just the most obvious
> > roadblock; I bet there are more. If you really had to run Hyp mode
> > code in a VM you probably have to do it by having it all emulated
> > via TCG.
> 
> Or in an in-kernel instruction emulator that we have lying around anyways. For kvm-in-kvm that should be good enough, as we only need to execute a few instructions in HYP mode.
> 
Will require emulation on each trap to Hyp mode tough. But since you
already have ideas about nested Hyp I consider it done :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
  2013-08-09  4:07   ` Christoffer Dall
@ 2013-09-23 10:11     ` Marc Zyngier
  -1 siblings, 0 replies; 36+ messages in thread
From: Marc Zyngier @ 2013-09-23 10:11 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: kvmarm, kvm, linux-arm-kernel, linaro-kernel, patches, Christoffer Dall

Hi Christoffer,

Finally taking some time to review this patch. Sorry for the delay...

On 09/08/13 05:07, Christoffer Dall wrote:
> From: Christoffer Dall <cdall@cs.columbia.edu>
> 
> Support transparent huge pages in KVM/ARM 32-bit and 64-bit.  The whole
> transparent_hugepage_adjust stuff is far from pretty, but this is how
> it's solved on x86 so we duplicate their logic.  This should be shared
> across architectures if possible (like many other things), but can
> always be changed down the road.
> 
> The pud_huge checking on the unmap path may feel a bit silly as the
> pud_huge check is always defined to false, but the compiler should be
> smart about this.
> 
> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> ---
>  arch/arm/include/asm/kvm_mmu.h   |   17 +++-
>  arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
>  arch/arm64/include/asm/kvm_mmu.h |   12 ++-
>  3 files changed, 194 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 472ac70..ff3ff67 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
>  int kvm_mmu_init(void);
>  void kvm_clear_hyp_idmap(void);
> 
> +static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
> +{
> +       *pmd = new_pmd;
> +       flush_pmd_entry(pmd);
> +}
> +
>  static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
>  {
>         pte_val(*pte) = new_pte;
> @@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
>         pte_val(*pte) |= L_PTE_S2_RDWR;
>  }
> 
> +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
> +{
> +       pmd_val(*pmd) |= L_PTE_S2_RDWR;
> +}
> +
>  struct kvm;
> 
> -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
> +                                             unsigned long size)
>  {
>         /*
>          * If we are going to insert an instruction page and the icache is
> @@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>          * need any kind of flushing (DDI 0406C.b - Page B3-1392).
>          */
>         if (icache_is_pipt()) {
> -               unsigned long hva = gfn_to_hva(kvm, gfn);
> -               __cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
> +               __cpuc_coherent_user_range(hva, hva + size);
>         } else if (!icache_is_vivt_asid_tagged()) {
>                 /* any kind of VIPT cache */
>                 __flush_icache_all();
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 0988d9e..26ced77 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -19,6 +19,7 @@
>  #include <linux/mman.h>
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
> +#include <linux/hugetlb.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
>  #include <asm/cacheflush.h>
> @@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
>  static unsigned long hyp_idmap_end;
>  static phys_addr_t hyp_idmap_vector;
> 
> +#define kvm_pmd_huge(_x)       (pmd_huge(_x) || pmd_trans_huge(_x))
> +
>  static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>  {
>         /*
> @@ -93,19 +96,27 @@ static bool page_empty(void *ptr)
> 
>  static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
>  {
> -       pmd_t *pmd_table = pmd_offset(pud, 0);
> -       pud_clear(pud);
> -       kvm_tlb_flush_vmid_ipa(kvm, addr);
> -       pmd_free(NULL, pmd_table);
> +       if (pud_huge(*pud)) {
> +               pud_clear(pud);
> +       } else {
> +               pmd_t *pmd_table = pmd_offset(pud, 0);
> +               pud_clear(pud);
> +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> +               pmd_free(NULL, pmd_table);
> +       }

Why don't we have any TLB invalidation on the huge path?

>         put_page(virt_to_page(pud));
>  }
> 
>  static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
>  {
> -       pte_t *pte_table = pte_offset_kernel(pmd, 0);
> -       pmd_clear(pmd);
> -       kvm_tlb_flush_vmid_ipa(kvm, addr);
> -       pte_free_kernel(NULL, pte_table);
> +       if (kvm_pmd_huge(*pmd)) {
> +               pmd_clear(pmd);
> +       } else {
> +               pte_t *pte_table = pte_offset_kernel(pmd, 0);
> +               pmd_clear(pmd);
> +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> +               pte_free_kernel(NULL, pte_table);
> +       }

Same here.

>         put_page(virt_to_page(pmd));
>  }
> 
> @@ -136,12 +147,37 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
>                         continue;
>                 }
> 
> +               if (pud_huge(*pud)) {
> +                       /*
> +                        * If we are dealing with a huge pud, just clear it and
> +                        * move on.
> +                        */
> +                       clear_pud_entry(kvm, pud, addr);
> +                       addr = pud_addr_end(addr, end);
> +                       continue;
> +               }
> +
>                 pmd = pmd_offset(pud, addr);
>                 if (pmd_none(*pmd)) {
>                         addr = pmd_addr_end(addr, end);
>                         continue;
>                 }
> 
> +               if (kvm_pmd_huge(*pmd)) {
> +                       /*
> +                        * If we are dealing with a huge pmd, just clear it and
> +                        * walk back up the ladder.
> +                        */
> +                       clear_pmd_entry(kvm, pmd, addr);
> +                       next = pmd_addr_end(addr, end);
> +                       if (page_empty(pmd) && !page_empty(pud)) {
> +                               clear_pud_entry(kvm, pud, addr);
> +                               next = pud_addr_end(addr, end);
> +                       }
> +                       addr = next;
> +                       continue;
> +               }

Looks like we already have the exact same sequence a few lines below.
Surely we can factor it out.

> +
>                 pte = pte_offset_kernel(pmd, addr);
>                 clear_pte_entry(kvm, pte, addr);
>                 next = addr + PAGE_SIZE;
> @@ -420,6 +456,46 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>         kvm->arch.pgd = NULL;
>  }
> 
> +/* Set a huge page pmd entry */

Is it only for huge PMDs? Or can we try to share that code with what is
in stage2_set_pte?

> +static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> +                         phys_addr_t addr, const pmd_t *new_pmd)
> +{
> +       pgd_t *pgd;
> +       pud_t *pud;
> +       pmd_t *pmd, old_pmd;
> +
> +       /* Create stage-2 page table mapping - Level 1 */
> +       pgd = kvm->arch.pgd + pgd_index(addr);
> +       pud = pud_offset(pgd, addr);
> +       if (pud_none(*pud)) {
> +               if (!cache)
> +                       return 0; /* ignore calls from kvm_set_spte_hva */

Do we still need this?

> +               pmd = mmu_memory_cache_alloc(cache);
> +               pud_populate(NULL, pud, pmd);
> +               get_page(virt_to_page(pud));
> +       }
> +
> +       pmd = pmd_offset(pud, addr);
> +
> +       /*
> +        * Mapping in huge pages should only happen through a fault.  If a
> +        * page is merged into a transparent huge page, the individual
> +        * subpages of that huge page should be unmapped through MMU
> +        * notifiers before we get here.
> +        *
> +        * Merging of CompoundPages is not supported; they should become
> +        * splitting first, unmapped, merged, and mapped back in on-demand.
> +        */
> +       VM_BUG_ON(pmd_present(*pmd) && pmd_pfn(*pmd) != pmd_pfn(*new_pmd));
> +
> +       old_pmd = *pmd;
> +       kvm_set_pmd(pmd, *new_pmd);
> +       if (pmd_present(old_pmd))
> +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> +       else
> +               get_page(virt_to_page(pmd));
> +       return 0;
> +}
> 
>  static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
>                           phys_addr_t addr, const pte_t *new_pte, bool iomap)
> @@ -442,7 +518,7 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> 
>         pmd = pmd_offset(pud, addr);
> 
> -       /* Create 2nd stage page table mapping - Level 2 */
> +       /* Create 2nd stage page mappings - Level 2 */
>         if (pmd_none(*pmd)) {
>                 if (!cache)
>                         return 0; /* ignore calls from kvm_set_spte_hva */
> @@ -508,16 +584,53 @@ out:
>         return ret;
>  }
> 
> +static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
> +{
> +       pfn_t pfn = *pfnp;
> +       gfn_t gfn = *ipap >> PAGE_SHIFT;
> +
> +       if (PageTransCompound(pfn_to_page(pfn))) {
> +               unsigned long mask;
> +               /*
> +                * mmu_notifier_retry was successful and we hold the
> +                * mmu_lock here, so the pmd can't become splitting
> +                * from under us, and in turn
> +                * __split_huge_page_refcount() can't run from under
> +                * us and we can safely transfer the refcount from
> +                * PG_tail to PG_head as we switch the pfn from tail to
> +                * head.
> +                */

-ECANTPARSE. Well, I sort of can, but this deserves a clearer explanation.

> +               mask = (PMD_SIZE / PAGE_SIZE) - 1;

		mask = PTRS_PER_PMD -1;

> +               VM_BUG_ON((gfn & mask) != (pfn & mask));
> +               if (pfn & mask) {
> +                       gfn &= ~mask;

This doesn't seem to be used later on.

> +                       *ipap &= ~(PMD_SIZE - 1);

			*ipap &= ~PMD_MASK;

> +                       kvm_release_pfn_clean(pfn);
> +                       pfn &= ~mask;
> +                       kvm_get_pfn(pfn);
> +                       *pfnp = pfn;
> +               }



> +               return true;
> +       }
> +
> +       return false;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> -                         gfn_t gfn, struct kvm_memory_slot *memslot,
> +                         struct kvm_memory_slot *memslot,
>                           unsigned long fault_status)
>  {
> -       pte_t new_pte;
> -       pfn_t pfn;
>         int ret;
> -       bool write_fault, writable;
> +       bool write_fault, writable, hugetlb = false, force_pte = false;
>         unsigned long mmu_seq;
> +       gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +       unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
> +       struct kvm *kvm = vcpu->kvm;
>         struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> +       struct vm_area_struct *vma;
> +       pfn_t pfn;
> +       unsigned long psize;
> 
>         write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
>         if (fault_status == FSC_PERM && !write_fault) {
> @@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                 return -EFAULT;
>         }
> 
> +       /* Let's check if we will get back a huge page */
> +       down_read(&current->mm->mmap_sem);
> +       vma = find_vma_intersection(current->mm, hva, hva + 1);
> +       if (is_vm_hugetlb_page(vma)) {
> +               hugetlb = true;
> +               hva &= PMD_MASK;
> +               gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> +               psize = PMD_SIZE;
> +       } else {
> +               psize = PAGE_SIZE;
> +               if (vma->vm_start & ~PMD_MASK)
> +                       force_pte = true;
> +       }
> +       up_read(&current->mm->mmap_sem);
> +
> +       pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +       if (is_error_pfn(pfn))
> +               return -EFAULT;

How does this work with respect to the comment that talks about reading
mmu_seq before calling gfp_to_pfn_prot? Either the comment is wrong, or
we read this too early.

> +       coherent_icache_guest_page(kvm, hva, psize);
> +
>         /* We need minimum second+third level pages */
>         ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
>         if (ret)
> @@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>          */
>         smp_rmb();
> 
> -       pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
> -       if (is_error_pfn(pfn))
> -               return -EFAULT;
> -
> -       new_pte = pfn_pte(pfn, PAGE_S2);
> -       coherent_icache_guest_page(vcpu->kvm, gfn);
> -
> -       spin_lock(&vcpu->kvm->mmu_lock);
> -       if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> +       spin_lock(&kvm->mmu_lock);
> +       if (mmu_notifier_retry(kvm, mmu_seq))
>                 goto out_unlock;
> -       if (writable) {
> -               kvm_set_s2pte_writable(&new_pte);
> -               kvm_set_pfn_dirty(pfn);
> +       if (!hugetlb && !force_pte)
> +               hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);

How do we ensure that there is no race between this pte being promoted
to a pmd and another page allocation in the same pmd somewhere else in
the system? We only hold the kvm lock here, so there must be some extra
implicit guarantee somewhere...

> +       if (hugetlb) {
> +               pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
> +               new_pmd = pmd_mkhuge(new_pmd);
> +               if (writable) {
> +                       kvm_set_s2pmd_writable(&new_pmd);
> +                       kvm_set_pfn_dirty(pfn);
> +               }
> +               ret = stage2_set_pmd(kvm, memcache, fault_ipa, &new_pmd);
> +       } else {
> +               pte_t new_pte = pfn_pte(pfn, PAGE_S2);
> +               if (writable) {
> +                       kvm_set_s2pte_writable(&new_pte);
> +                       kvm_set_pfn_dirty(pfn);
> +               }
> +               ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
>         }
> -       stage2_set_pte(vcpu->kvm, memcache, fault_ipa, &new_pte, false);
> +
> 
>  out_unlock:
> -       spin_unlock(&vcpu->kvm->mmu_lock);
> +       spin_unlock(&kvm->mmu_lock);
>         kvm_release_pfn_clean(pfn);
> -       return 0;
> +       return ret;
>  }
> 
>  /**
> @@ -630,7 +772,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
> 
>         memslot = gfn_to_memslot(vcpu->kvm, gfn);
> 
> -       ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot, fault_status);
> +       ret = user_mem_abort(vcpu, fault_ipa, memslot, fault_status);
>         if (ret == 0)
>                 ret = 1;
>  out_unlock:
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index efe609c..e7f3852 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -91,6 +91,7 @@ int kvm_mmu_init(void);
>  void kvm_clear_hyp_idmap(void);
> 
>  #define        kvm_set_pte(ptep, pte)          set_pte(ptep, pte)
> +#define        kvm_set_pmd(pmdp, pmd)          set_pmd(pmdp, pmd)
> 
>  static inline bool kvm_is_write_fault(unsigned long esr)
>  {
> @@ -116,13 +117,18 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
>         pte_val(*pte) |= PTE_S2_RDWR;
>  }
> 
> +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
> +{
> +       pmd_val(*pmd) |= PTE_S2_RDWR;
> +}
> +
>  struct kvm;
> 
> -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
> +                                             unsigned long size)
>  {
>         if (!icache_is_aliasing()) {            /* PIPT */
> -               unsigned long hva = gfn_to_hva(kvm, gfn);
> -               flush_icache_range(hva, hva + PAGE_SIZE);
> +               flush_icache_range(hva, hva + size);
>         } else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
>                 /* any kind of VIPT cache */
>                 __flush_icache_all();

As a general comment about this patch, I think having both transparent
pages *and* hugetlbfs support in the same patch makes it fairly hard to
read.

I understand that is is actually convenient as most of the code is
shared, but I think it would make it easier to read if the patch was
split in two: first the basic hugetlb support, and then the THP madness.

What do you think?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
@ 2013-09-23 10:11     ` Marc Zyngier
  0 siblings, 0 replies; 36+ messages in thread
From: Marc Zyngier @ 2013-09-23 10:11 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Christoffer,

Finally taking some time to review this patch. Sorry for the delay...

On 09/08/13 05:07, Christoffer Dall wrote:
> From: Christoffer Dall <cdall@cs.columbia.edu>
> 
> Support transparent huge pages in KVM/ARM 32-bit and 64-bit.  The whole
> transparent_hugepage_adjust stuff is far from pretty, but this is how
> it's solved on x86 so we duplicate their logic.  This should be shared
> across architectures if possible (like many other things), but can
> always be changed down the road.
> 
> The pud_huge checking on the unmap path may feel a bit silly as the
> pud_huge check is always defined to false, but the compiler should be
> smart about this.
> 
> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> ---
>  arch/arm/include/asm/kvm_mmu.h   |   17 +++-
>  arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
>  arch/arm64/include/asm/kvm_mmu.h |   12 ++-
>  3 files changed, 194 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 472ac70..ff3ff67 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
>  int kvm_mmu_init(void);
>  void kvm_clear_hyp_idmap(void);
> 
> +static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
> +{
> +       *pmd = new_pmd;
> +       flush_pmd_entry(pmd);
> +}
> +
>  static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
>  {
>         pte_val(*pte) = new_pte;
> @@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
>         pte_val(*pte) |= L_PTE_S2_RDWR;
>  }
> 
> +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
> +{
> +       pmd_val(*pmd) |= L_PTE_S2_RDWR;
> +}
> +
>  struct kvm;
> 
> -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
> +                                             unsigned long size)
>  {
>         /*
>          * If we are going to insert an instruction page and the icache is
> @@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>          * need any kind of flushing (DDI 0406C.b - Page B3-1392).
>          */
>         if (icache_is_pipt()) {
> -               unsigned long hva = gfn_to_hva(kvm, gfn);
> -               __cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
> +               __cpuc_coherent_user_range(hva, hva + size);
>         } else if (!icache_is_vivt_asid_tagged()) {
>                 /* any kind of VIPT cache */
>                 __flush_icache_all();
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 0988d9e..26ced77 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -19,6 +19,7 @@
>  #include <linux/mman.h>
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
> +#include <linux/hugetlb.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
>  #include <asm/cacheflush.h>
> @@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
>  static unsigned long hyp_idmap_end;
>  static phys_addr_t hyp_idmap_vector;
> 
> +#define kvm_pmd_huge(_x)       (pmd_huge(_x) || pmd_trans_huge(_x))
> +
>  static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>  {
>         /*
> @@ -93,19 +96,27 @@ static bool page_empty(void *ptr)
> 
>  static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
>  {
> -       pmd_t *pmd_table = pmd_offset(pud, 0);
> -       pud_clear(pud);
> -       kvm_tlb_flush_vmid_ipa(kvm, addr);
> -       pmd_free(NULL, pmd_table);
> +       if (pud_huge(*pud)) {
> +               pud_clear(pud);
> +       } else {
> +               pmd_t *pmd_table = pmd_offset(pud, 0);
> +               pud_clear(pud);
> +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> +               pmd_free(NULL, pmd_table);
> +       }

Why don't we have any TLB invalidation on the huge path?

>         put_page(virt_to_page(pud));
>  }
> 
>  static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
>  {
> -       pte_t *pte_table = pte_offset_kernel(pmd, 0);
> -       pmd_clear(pmd);
> -       kvm_tlb_flush_vmid_ipa(kvm, addr);
> -       pte_free_kernel(NULL, pte_table);
> +       if (kvm_pmd_huge(*pmd)) {
> +               pmd_clear(pmd);
> +       } else {
> +               pte_t *pte_table = pte_offset_kernel(pmd, 0);
> +               pmd_clear(pmd);
> +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> +               pte_free_kernel(NULL, pte_table);
> +       }

Same here.

>         put_page(virt_to_page(pmd));
>  }
> 
> @@ -136,12 +147,37 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
>                         continue;
>                 }
> 
> +               if (pud_huge(*pud)) {
> +                       /*
> +                        * If we are dealing with a huge pud, just clear it and
> +                        * move on.
> +                        */
> +                       clear_pud_entry(kvm, pud, addr);
> +                       addr = pud_addr_end(addr, end);
> +                       continue;
> +               }
> +
>                 pmd = pmd_offset(pud, addr);
>                 if (pmd_none(*pmd)) {
>                         addr = pmd_addr_end(addr, end);
>                         continue;
>                 }
> 
> +               if (kvm_pmd_huge(*pmd)) {
> +                       /*
> +                        * If we are dealing with a huge pmd, just clear it and
> +                        * walk back up the ladder.
> +                        */
> +                       clear_pmd_entry(kvm, pmd, addr);
> +                       next = pmd_addr_end(addr, end);
> +                       if (page_empty(pmd) && !page_empty(pud)) {
> +                               clear_pud_entry(kvm, pud, addr);
> +                               next = pud_addr_end(addr, end);
> +                       }
> +                       addr = next;
> +                       continue;
> +               }

Looks like we already have the exact same sequence a few lines below.
Surely we can factor it out.

> +
>                 pte = pte_offset_kernel(pmd, addr);
>                 clear_pte_entry(kvm, pte, addr);
>                 next = addr + PAGE_SIZE;
> @@ -420,6 +456,46 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>         kvm->arch.pgd = NULL;
>  }
> 
> +/* Set a huge page pmd entry */

Is it only for huge PMDs? Or can we try to share that code with what is
in stage2_set_pte?

> +static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> +                         phys_addr_t addr, const pmd_t *new_pmd)
> +{
> +       pgd_t *pgd;
> +       pud_t *pud;
> +       pmd_t *pmd, old_pmd;
> +
> +       /* Create stage-2 page table mapping - Level 1 */
> +       pgd = kvm->arch.pgd + pgd_index(addr);
> +       pud = pud_offset(pgd, addr);
> +       if (pud_none(*pud)) {
> +               if (!cache)
> +                       return 0; /* ignore calls from kvm_set_spte_hva */

Do we still need this?

> +               pmd = mmu_memory_cache_alloc(cache);
> +               pud_populate(NULL, pud, pmd);
> +               get_page(virt_to_page(pud));
> +       }
> +
> +       pmd = pmd_offset(pud, addr);
> +
> +       /*
> +        * Mapping in huge pages should only happen through a fault.  If a
> +        * page is merged into a transparent huge page, the individual
> +        * subpages of that huge page should be unmapped through MMU
> +        * notifiers before we get here.
> +        *
> +        * Merging of CompoundPages is not supported; they should become
> +        * splitting first, unmapped, merged, and mapped back in on-demand.
> +        */
> +       VM_BUG_ON(pmd_present(*pmd) && pmd_pfn(*pmd) != pmd_pfn(*new_pmd));
> +
> +       old_pmd = *pmd;
> +       kvm_set_pmd(pmd, *new_pmd);
> +       if (pmd_present(old_pmd))
> +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> +       else
> +               get_page(virt_to_page(pmd));
> +       return 0;
> +}
> 
>  static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
>                           phys_addr_t addr, const pte_t *new_pte, bool iomap)
> @@ -442,7 +518,7 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> 
>         pmd = pmd_offset(pud, addr);
> 
> -       /* Create 2nd stage page table mapping - Level 2 */
> +       /* Create 2nd stage page mappings - Level 2 */
>         if (pmd_none(*pmd)) {
>                 if (!cache)
>                         return 0; /* ignore calls from kvm_set_spte_hva */
> @@ -508,16 +584,53 @@ out:
>         return ret;
>  }
> 
> +static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
> +{
> +       pfn_t pfn = *pfnp;
> +       gfn_t gfn = *ipap >> PAGE_SHIFT;
> +
> +       if (PageTransCompound(pfn_to_page(pfn))) {
> +               unsigned long mask;
> +               /*
> +                * mmu_notifier_retry was successful and we hold the
> +                * mmu_lock here, so the pmd can't become splitting
> +                * from under us, and in turn
> +                * __split_huge_page_refcount() can't run from under
> +                * us and we can safely transfer the refcount from
> +                * PG_tail to PG_head as we switch the pfn from tail to
> +                * head.
> +                */

-ECANTPARSE. Well, I sort of can, but this deserves a clearer explanation.

> +               mask = (PMD_SIZE / PAGE_SIZE) - 1;

		mask = PTRS_PER_PMD -1;

> +               VM_BUG_ON((gfn & mask) != (pfn & mask));
> +               if (pfn & mask) {
> +                       gfn &= ~mask;

This doesn't seem to be used later on.

> +                       *ipap &= ~(PMD_SIZE - 1);

			*ipap &= ~PMD_MASK;

> +                       kvm_release_pfn_clean(pfn);
> +                       pfn &= ~mask;
> +                       kvm_get_pfn(pfn);
> +                       *pfnp = pfn;
> +               }



> +               return true;
> +       }
> +
> +       return false;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> -                         gfn_t gfn, struct kvm_memory_slot *memslot,
> +                         struct kvm_memory_slot *memslot,
>                           unsigned long fault_status)
>  {
> -       pte_t new_pte;
> -       pfn_t pfn;
>         int ret;
> -       bool write_fault, writable;
> +       bool write_fault, writable, hugetlb = false, force_pte = false;
>         unsigned long mmu_seq;
> +       gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +       unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
> +       struct kvm *kvm = vcpu->kvm;
>         struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> +       struct vm_area_struct *vma;
> +       pfn_t pfn;
> +       unsigned long psize;
> 
>         write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
>         if (fault_status == FSC_PERM && !write_fault) {
> @@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                 return -EFAULT;
>         }
> 
> +       /* Let's check if we will get back a huge page */
> +       down_read(&current->mm->mmap_sem);
> +       vma = find_vma_intersection(current->mm, hva, hva + 1);
> +       if (is_vm_hugetlb_page(vma)) {
> +               hugetlb = true;
> +               hva &= PMD_MASK;
> +               gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> +               psize = PMD_SIZE;
> +       } else {
> +               psize = PAGE_SIZE;
> +               if (vma->vm_start & ~PMD_MASK)
> +                       force_pte = true;
> +       }
> +       up_read(&current->mm->mmap_sem);
> +
> +       pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +       if (is_error_pfn(pfn))
> +               return -EFAULT;

How does this work with respect to the comment that talks about reading
mmu_seq before calling gfp_to_pfn_prot? Either the comment is wrong, or
we read this too early.

> +       coherent_icache_guest_page(kvm, hva, psize);
> +
>         /* We need minimum second+third level pages */
>         ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
>         if (ret)
> @@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>          */
>         smp_rmb();
> 
> -       pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
> -       if (is_error_pfn(pfn))
> -               return -EFAULT;
> -
> -       new_pte = pfn_pte(pfn, PAGE_S2);
> -       coherent_icache_guest_page(vcpu->kvm, gfn);
> -
> -       spin_lock(&vcpu->kvm->mmu_lock);
> -       if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> +       spin_lock(&kvm->mmu_lock);
> +       if (mmu_notifier_retry(kvm, mmu_seq))
>                 goto out_unlock;
> -       if (writable) {
> -               kvm_set_s2pte_writable(&new_pte);
> -               kvm_set_pfn_dirty(pfn);
> +       if (!hugetlb && !force_pte)
> +               hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);

How do we ensure that there is no race between this pte being promoted
to a pmd and another page allocation in the same pmd somewhere else in
the system? We only hold the kvm lock here, so there must be some extra
implicit guarantee somewhere...

> +       if (hugetlb) {
> +               pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
> +               new_pmd = pmd_mkhuge(new_pmd);
> +               if (writable) {
> +                       kvm_set_s2pmd_writable(&new_pmd);
> +                       kvm_set_pfn_dirty(pfn);
> +               }
> +               ret = stage2_set_pmd(kvm, memcache, fault_ipa, &new_pmd);
> +       } else {
> +               pte_t new_pte = pfn_pte(pfn, PAGE_S2);
> +               if (writable) {
> +                       kvm_set_s2pte_writable(&new_pte);
> +                       kvm_set_pfn_dirty(pfn);
> +               }
> +               ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
>         }
> -       stage2_set_pte(vcpu->kvm, memcache, fault_ipa, &new_pte, false);
> +
> 
>  out_unlock:
> -       spin_unlock(&vcpu->kvm->mmu_lock);
> +       spin_unlock(&kvm->mmu_lock);
>         kvm_release_pfn_clean(pfn);
> -       return 0;
> +       return ret;
>  }
> 
>  /**
> @@ -630,7 +772,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
> 
>         memslot = gfn_to_memslot(vcpu->kvm, gfn);
> 
> -       ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot, fault_status);
> +       ret = user_mem_abort(vcpu, fault_ipa, memslot, fault_status);
>         if (ret == 0)
>                 ret = 1;
>  out_unlock:
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index efe609c..e7f3852 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -91,6 +91,7 @@ int kvm_mmu_init(void);
>  void kvm_clear_hyp_idmap(void);
> 
>  #define        kvm_set_pte(ptep, pte)          set_pte(ptep, pte)
> +#define        kvm_set_pmd(pmdp, pmd)          set_pmd(pmdp, pmd)
> 
>  static inline bool kvm_is_write_fault(unsigned long esr)
>  {
> @@ -116,13 +117,18 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
>         pte_val(*pte) |= PTE_S2_RDWR;
>  }
> 
> +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
> +{
> +       pmd_val(*pmd) |= PTE_S2_RDWR;
> +}
> +
>  struct kvm;
> 
> -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
> +                                             unsigned long size)
>  {
>         if (!icache_is_aliasing()) {            /* PIPT */
> -               unsigned long hva = gfn_to_hva(kvm, gfn);
> -               flush_icache_range(hva, hva + PAGE_SIZE);
> +               flush_icache_range(hva, hva + size);
>         } else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
>                 /* any kind of VIPT cache */
>                 __flush_icache_all();

As a general comment about this patch, I think having both transparent
pages *and* hugetlbfs support in the same patch makes it fairly hard to
read.

I understand that is is actually convenient as most of the code is
shared, but I think it would make it easier to read if the patch was
split in two: first the basic hugetlb support, and then the THP madness.

What do you think?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
  2013-09-23 10:11     ` Marc Zyngier
@ 2013-09-23 14:46       ` Marc Zyngier
  -1 siblings, 0 replies; 36+ messages in thread
From: Marc Zyngier @ 2013-09-23 14:46 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Christoffer Dall, linaro-kernel, kvm, patches, kvmarm, linux-arm-kernel

On 23/09/13 11:11, Marc Zyngier wrote:
> Hi Christoffer,
> 
> Finally taking some time to review this patch. Sorry for the delay...
> 
> On 09/08/13 05:07, Christoffer Dall wrote:
>> From: Christoffer Dall <cdall@cs.columbia.edu>
>>
>> Support transparent huge pages in KVM/ARM 32-bit and 64-bit.  The whole
>> transparent_hugepage_adjust stuff is far from pretty, but this is how
>> it's solved on x86 so we duplicate their logic.  This should be shared
>> across architectures if possible (like many other things), but can
>> always be changed down the road.
>>
>> The pud_huge checking on the unmap path may feel a bit silly as the
>> pud_huge check is always defined to false, but the compiler should be
>> smart about this.
>>
>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>> ---
>>  arch/arm/include/asm/kvm_mmu.h   |   17 +++-
>>  arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
>>  arch/arm64/include/asm/kvm_mmu.h |   12 ++-
>>  3 files changed, 194 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 472ac70..ff3ff67 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
>>  int kvm_mmu_init(void);
>>  void kvm_clear_hyp_idmap(void);
>>
>> +static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
>> +{
>> +       *pmd = new_pmd;
>> +       flush_pmd_entry(pmd);
>> +}
>> +
>>  static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
>>  {
>>         pte_val(*pte) = new_pte;
>> @@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
>>         pte_val(*pte) |= L_PTE_S2_RDWR;
>>  }
>>
>> +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
>> +{
>> +       pmd_val(*pmd) |= L_PTE_S2_RDWR;
>> +}
>> +
>>  struct kvm;
>>
>> -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>> +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
>> +                                             unsigned long size)
>>  {
>>         /*
>>          * If we are going to insert an instruction page and the icache is
>> @@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>          * need any kind of flushing (DDI 0406C.b - Page B3-1392).
>>          */
>>         if (icache_is_pipt()) {
>> -               unsigned long hva = gfn_to_hva(kvm, gfn);
>> -               __cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
>> +               __cpuc_coherent_user_range(hva, hva + size);
>>         } else if (!icache_is_vivt_asid_tagged()) {
>>                 /* any kind of VIPT cache */
>>                 __flush_icache_all();
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 0988d9e..26ced77 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/mman.h>
>>  #include <linux/kvm_host.h>
>>  #include <linux/io.h>
>> +#include <linux/hugetlb.h>
>>  #include <trace/events/kvm.h>
>>  #include <asm/pgalloc.h>
>>  #include <asm/cacheflush.h>
>> @@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
>>  static unsigned long hyp_idmap_end;
>>  static phys_addr_t hyp_idmap_vector;
>>
>> +#define kvm_pmd_huge(_x)       (pmd_huge(_x) || pmd_trans_huge(_x))

Also, have a look at Steve's latest patch:
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-September/199812.html

He has the exact same define, so there's room for consolidation there as
well. Not a big issue though, and we can keep that for a later cleanup.

Cheers,

	M.
-- 
Jazz is not dead. It just smells funny...


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
@ 2013-09-23 14:46       ` Marc Zyngier
  0 siblings, 0 replies; 36+ messages in thread
From: Marc Zyngier @ 2013-09-23 14:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/09/13 11:11, Marc Zyngier wrote:
> Hi Christoffer,
> 
> Finally taking some time to review this patch. Sorry for the delay...
> 
> On 09/08/13 05:07, Christoffer Dall wrote:
>> From: Christoffer Dall <cdall@cs.columbia.edu>
>>
>> Support transparent huge pages in KVM/ARM 32-bit and 64-bit.  The whole
>> transparent_hugepage_adjust stuff is far from pretty, but this is how
>> it's solved on x86 so we duplicate their logic.  This should be shared
>> across architectures if possible (like many other things), but can
>> always be changed down the road.
>>
>> The pud_huge checking on the unmap path may feel a bit silly as the
>> pud_huge check is always defined to false, but the compiler should be
>> smart about this.
>>
>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
>> ---
>>  arch/arm/include/asm/kvm_mmu.h   |   17 +++-
>>  arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
>>  arch/arm64/include/asm/kvm_mmu.h |   12 ++-
>>  3 files changed, 194 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 472ac70..ff3ff67 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
>>  int kvm_mmu_init(void);
>>  void kvm_clear_hyp_idmap(void);
>>
>> +static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
>> +{
>> +       *pmd = new_pmd;
>> +       flush_pmd_entry(pmd);
>> +}
>> +
>>  static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
>>  {
>>         pte_val(*pte) = new_pte;
>> @@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
>>         pte_val(*pte) |= L_PTE_S2_RDWR;
>>  }
>>
>> +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
>> +{
>> +       pmd_val(*pmd) |= L_PTE_S2_RDWR;
>> +}
>> +
>>  struct kvm;
>>
>> -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>> +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
>> +                                             unsigned long size)
>>  {
>>         /*
>>          * If we are going to insert an instruction page and the icache is
>> @@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>          * need any kind of flushing (DDI 0406C.b - Page B3-1392).
>>          */
>>         if (icache_is_pipt()) {
>> -               unsigned long hva = gfn_to_hva(kvm, gfn);
>> -               __cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
>> +               __cpuc_coherent_user_range(hva, hva + size);
>>         } else if (!icache_is_vivt_asid_tagged()) {
>>                 /* any kind of VIPT cache */
>>                 __flush_icache_all();
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 0988d9e..26ced77 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/mman.h>
>>  #include <linux/kvm_host.h>
>>  #include <linux/io.h>
>> +#include <linux/hugetlb.h>
>>  #include <trace/events/kvm.h>
>>  #include <asm/pgalloc.h>
>>  #include <asm/cacheflush.h>
>> @@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
>>  static unsigned long hyp_idmap_end;
>>  static phys_addr_t hyp_idmap_vector;
>>
>> +#define kvm_pmd_huge(_x)       (pmd_huge(_x) || pmd_trans_huge(_x))

Also, have a look at Steve's latest patch:
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-September/199812.html

He has the exact same define, so there's room for consolidation there as
well. Not a big issue though, and we can keep that for a later cleanup.

Cheers,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
  2013-09-23 10:11     ` Marc Zyngier
@ 2013-09-24 15:41       ` Steve Capper
  -1 siblings, 0 replies; 36+ messages in thread
From: Steve Capper @ 2013-09-24 15:41 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Christoffer Dall, Christoffer Dall, linaro-kernel, kvm, patches,
	kvmarm, linux-arm-kernel

On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:
> Hi Christoffer,
> 
> Finally taking some time to review this patch. Sorry for the delay...
> 
> On 09/08/13 05:07, Christoffer Dall wrote:
> > From: Christoffer Dall <cdall@cs.columbia.edu>
> > 

[ snip ]

> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > -                         gfn_t gfn, struct kvm_memory_slot *memslot,
> > +                         struct kvm_memory_slot *memslot,
> >                           unsigned long fault_status)
> >  {
> > -       pte_t new_pte;
> > -       pfn_t pfn;
> >         int ret;
> > -       bool write_fault, writable;
> > +       bool write_fault, writable, hugetlb = false, force_pte = false;
> >         unsigned long mmu_seq;
> > +       gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +       unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
> > +       struct kvm *kvm = vcpu->kvm;
> >         struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> > +       struct vm_area_struct *vma;
> > +       pfn_t pfn;
> > +       unsigned long psize;
> > 
> >         write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
> >         if (fault_status == FSC_PERM && !write_fault) {
> > @@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 return -EFAULT;
> >         }
> > 
> > +       /* Let's check if we will get back a huge page */
> > +       down_read(&current->mm->mmap_sem);
> > +       vma = find_vma_intersection(current->mm, hva, hva + 1);
> > +       if (is_vm_hugetlb_page(vma)) {
> > +               hugetlb = true;
> > +               hva &= PMD_MASK;
> > +               gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> > +               psize = PMD_SIZE;
> > +       } else {
> > +               psize = PAGE_SIZE;
> > +               if (vma->vm_start & ~PMD_MASK)
> > +                       force_pte = true;
> > +       }
> > +       up_read(&current->mm->mmap_sem);
> > +
> > +       pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> > +       if (is_error_pfn(pfn))
> > +               return -EFAULT;
> 
> How does this work with respect to the comment that talks about reading
> mmu_seq before calling gfp_to_pfn_prot? Either the comment is wrong, or
> we read this too early.
> 
> > +       coherent_icache_guest_page(kvm, hva, psize);
> > +
> >         /* We need minimum second+third level pages */
> >         ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
> >         if (ret)
> > @@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >          */
> >         smp_rmb();
> > 
> > -       pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
> > -       if (is_error_pfn(pfn))
> > -               return -EFAULT;
> > -
> > -       new_pte = pfn_pte(pfn, PAGE_S2);
> > -       coherent_icache_guest_page(vcpu->kvm, gfn);
> > -
> > -       spin_lock(&vcpu->kvm->mmu_lock);
> > -       if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> > +       spin_lock(&kvm->mmu_lock);
> > +       if (mmu_notifier_retry(kvm, mmu_seq))
> >                 goto out_unlock;
> > -       if (writable) {
> > -               kvm_set_s2pte_writable(&new_pte);
> > -               kvm_set_pfn_dirty(pfn);
> > +       if (!hugetlb && !force_pte)
> > +               hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> 
> How do we ensure that there is no race between this pte being promoted
> to a pmd and another page allocation in the same pmd somewhere else in
> the system? We only hold the kvm lock here, so there must be some extra
> implicit guarantee somewhere...
> 

This isn't a promotion to a huge page, it is a mechanism to ensure that
pfn corresponds with the head page of a THP as that's where refcount
information is stored. I think this is safe.

I'm still getting my brain working with kvm, so sorry don't have any
other feedback yet sorry :-).

Cheers,
-- 
Steve


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
@ 2013-09-24 15:41       ` Steve Capper
  0 siblings, 0 replies; 36+ messages in thread
From: Steve Capper @ 2013-09-24 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:
> Hi Christoffer,
> 
> Finally taking some time to review this patch. Sorry for the delay...
> 
> On 09/08/13 05:07, Christoffer Dall wrote:
> > From: Christoffer Dall <cdall@cs.columbia.edu>
> > 

[ snip ]

> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > -                         gfn_t gfn, struct kvm_memory_slot *memslot,
> > +                         struct kvm_memory_slot *memslot,
> >                           unsigned long fault_status)
> >  {
> > -       pte_t new_pte;
> > -       pfn_t pfn;
> >         int ret;
> > -       bool write_fault, writable;
> > +       bool write_fault, writable, hugetlb = false, force_pte = false;
> >         unsigned long mmu_seq;
> > +       gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +       unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
> > +       struct kvm *kvm = vcpu->kvm;
> >         struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> > +       struct vm_area_struct *vma;
> > +       pfn_t pfn;
> > +       unsigned long psize;
> > 
> >         write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
> >         if (fault_status == FSC_PERM && !write_fault) {
> > @@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 return -EFAULT;
> >         }
> > 
> > +       /* Let's check if we will get back a huge page */
> > +       down_read(&current->mm->mmap_sem);
> > +       vma = find_vma_intersection(current->mm, hva, hva + 1);
> > +       if (is_vm_hugetlb_page(vma)) {
> > +               hugetlb = true;
> > +               hva &= PMD_MASK;
> > +               gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> > +               psize = PMD_SIZE;
> > +       } else {
> > +               psize = PAGE_SIZE;
> > +               if (vma->vm_start & ~PMD_MASK)
> > +                       force_pte = true;
> > +       }
> > +       up_read(&current->mm->mmap_sem);
> > +
> > +       pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> > +       if (is_error_pfn(pfn))
> > +               return -EFAULT;
> 
> How does this work with respect to the comment that talks about reading
> mmu_seq before calling gfp_to_pfn_prot? Either the comment is wrong, or
> we read this too early.
> 
> > +       coherent_icache_guest_page(kvm, hva, psize);
> > +
> >         /* We need minimum second+third level pages */
> >         ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
> >         if (ret)
> > @@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >          */
> >         smp_rmb();
> > 
> > -       pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
> > -       if (is_error_pfn(pfn))
> > -               return -EFAULT;
> > -
> > -       new_pte = pfn_pte(pfn, PAGE_S2);
> > -       coherent_icache_guest_page(vcpu->kvm, gfn);
> > -
> > -       spin_lock(&vcpu->kvm->mmu_lock);
> > -       if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> > +       spin_lock(&kvm->mmu_lock);
> > +       if (mmu_notifier_retry(kvm, mmu_seq))
> >                 goto out_unlock;
> > -       if (writable) {
> > -               kvm_set_s2pte_writable(&new_pte);
> > -               kvm_set_pfn_dirty(pfn);
> > +       if (!hugetlb && !force_pte)
> > +               hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> 
> How do we ensure that there is no race between this pte being promoted
> to a pmd and another page allocation in the same pmd somewhere else in
> the system? We only hold the kvm lock here, so there must be some extra
> implicit guarantee somewhere...
> 

This isn't a promotion to a huge page, it is a mechanism to ensure that
pfn corresponds with the head page of a THP as that's where refcount
information is stored. I think this is safe.

I'm still getting my brain working with kvm, so sorry don't have any
other feedback yet sorry :-).

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
  2013-09-23 10:11     ` Marc Zyngier
@ 2013-10-02 22:36       ` Christoffer Dall
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-10-02 22:36 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, kvm, linux-arm-kernel, linaro-kernel, patches, Christoffer Dall

On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:
> Hi Christoffer,
> 
> Finally taking some time to review this patch. Sorry for the delay...
> 
> On 09/08/13 05:07, Christoffer Dall wrote:
> > From: Christoffer Dall <cdall@cs.columbia.edu>
> > 
> > Support transparent huge pages in KVM/ARM 32-bit and 64-bit.  The whole
> > transparent_hugepage_adjust stuff is far from pretty, but this is how
> > it's solved on x86 so we duplicate their logic.  This should be shared
> > across architectures if possible (like many other things), but can
> > always be changed down the road.
> > 
> > The pud_huge checking on the unmap path may feel a bit silly as the
> > pud_huge check is always defined to false, but the compiler should be
> > smart about this.
> > 
> > Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> > ---
> >  arch/arm/include/asm/kvm_mmu.h   |   17 +++-
> >  arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
> >  arch/arm64/include/asm/kvm_mmu.h |   12 ++-
> >  3 files changed, 194 insertions(+), 35 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> > index 472ac70..ff3ff67 100644
> > --- a/arch/arm/include/asm/kvm_mmu.h
> > +++ b/arch/arm/include/asm/kvm_mmu.h
> > @@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
> >  int kvm_mmu_init(void);
> >  void kvm_clear_hyp_idmap(void);
> > 
> > +static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
> > +{
> > +       *pmd = new_pmd;
> > +       flush_pmd_entry(pmd);
> > +}
> > +
> >  static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
> >  {
> >         pte_val(*pte) = new_pte;
> > @@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
> >         pte_val(*pte) |= L_PTE_S2_RDWR;
> >  }
> > 
> > +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
> > +{
> > +       pmd_val(*pmd) |= L_PTE_S2_RDWR;
> > +}
> > +
> >  struct kvm;
> > 
> > -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> > +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
> > +                                             unsigned long size)
> >  {
> >         /*
> >          * If we are going to insert an instruction page and the icache is
> > @@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> >          * need any kind of flushing (DDI 0406C.b - Page B3-1392).
> >          */
> >         if (icache_is_pipt()) {
> > -               unsigned long hva = gfn_to_hva(kvm, gfn);
> > -               __cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
> > +               __cpuc_coherent_user_range(hva, hva + size);
> >         } else if (!icache_is_vivt_asid_tagged()) {
> >                 /* any kind of VIPT cache */
> >                 __flush_icache_all();
> > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> > index 0988d9e..26ced77 100644
> > --- a/arch/arm/kvm/mmu.c
> > +++ b/arch/arm/kvm/mmu.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/kvm_host.h>
> >  #include <linux/io.h>
> > +#include <linux/hugetlb.h>
> >  #include <trace/events/kvm.h>
> >  #include <asm/pgalloc.h>
> >  #include <asm/cacheflush.h>
> > @@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
> >  static unsigned long hyp_idmap_end;
> >  static phys_addr_t hyp_idmap_vector;
> > 
> > +#define kvm_pmd_huge(_x)       (pmd_huge(_x) || pmd_trans_huge(_x))
> > +
> >  static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
> >  {
> >         /*
> > @@ -93,19 +96,27 @@ static bool page_empty(void *ptr)
> > 
> >  static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> >  {
> > -       pmd_t *pmd_table = pmd_offset(pud, 0);
> > -       pud_clear(pud);
> > -       kvm_tlb_flush_vmid_ipa(kvm, addr);
> > -       pmd_free(NULL, pmd_table);
> > +       if (pud_huge(*pud)) {
> > +               pud_clear(pud);
> > +       } else {
> > +               pmd_t *pmd_table = pmd_offset(pud, 0);
> > +               pud_clear(pud);
> > +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> > +               pmd_free(NULL, pmd_table);
> > +       }
> 
> Why don't we have any TLB invalidation on the huge path?
> 

Because this code was written before we fixed the TLB invalidation race.
Good that you spotted this.

> >         put_page(virt_to_page(pud));
> >  }
> > 
> >  static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
> >  {
> > -       pte_t *pte_table = pte_offset_kernel(pmd, 0);
> > -       pmd_clear(pmd);
> > -       kvm_tlb_flush_vmid_ipa(kvm, addr);
> > -       pte_free_kernel(NULL, pte_table);
> > +       if (kvm_pmd_huge(*pmd)) {
> > +               pmd_clear(pmd);
> > +       } else {
> > +               pte_t *pte_table = pte_offset_kernel(pmd, 0);
> > +               pmd_clear(pmd);
> > +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> > +               pte_free_kernel(NULL, pte_table);
> > +       }
> 
> Same here.
> 

Same here.

> >         put_page(virt_to_page(pmd));
> >  }
> > 
> > @@ -136,12 +147,37 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
> >                         continue;
> >                 }
> > 
> > +               if (pud_huge(*pud)) {
> > +                       /*
> > +                        * If we are dealing with a huge pud, just clear it and
> > +                        * move on.
> > +                        */
> > +                       clear_pud_entry(kvm, pud, addr);
> > +                       addr = pud_addr_end(addr, end);
> > +                       continue;
> > +               }
> > +
> >                 pmd = pmd_offset(pud, addr);
> >                 if (pmd_none(*pmd)) {
> >                         addr = pmd_addr_end(addr, end);
> >                         continue;
> >                 }
> > 
> > +               if (kvm_pmd_huge(*pmd)) {
> > +                       /*
> > +                        * If we are dealing with a huge pmd, just clear it and
> > +                        * walk back up the ladder.
> > +                        */
> > +                       clear_pmd_entry(kvm, pmd, addr);
> > +                       next = pmd_addr_end(addr, end);
> > +                       if (page_empty(pmd) && !page_empty(pud)) {
> > +                               clear_pud_entry(kvm, pud, addr);
> > +                               next = pud_addr_end(addr, end);
> > +                       }
> > +                       addr = next;
> > +                       continue;
> > +               }
> 
> Looks like we already have the exact same sequence a few lines below.
> Surely we can factor it out.
> 

Yeah, the thing is having a funciton here requires us to pass in a bunch
of things, an either return the next addr or pass that by reference, and
while it does save a few lines, I found the code harder to read.  This
is another option:


static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
			unsigned long long start, u64 size)
{
	pgd_t *pgd;
	pud_t *pud;
	pmd_t *pmd;
	pte_t *pte;
	unsigned long long addr = start, end = start + size;
	u64 next;

	while (addr < end) {
		pgd = pgdp + pgd_index(addr);
		pud = pud_offset(pgd, addr);
		if (pud_none(*pud)) {
			addr = pud_addr_end(addr, end);
			continue;
		}

		if (pud_huge(*pud)) {
			/*
			 * If we are dealing with a huge pud, just clear it and
			 * move on.
			 */
			clear_pud_entry(kvm, pud, addr);
			addr = pud_addr_end(addr, end);
			continue;
		}

		pmd = pmd_offset(pud, addr);
		if (pmd_none(*pmd)) {
			addr = pmd_addr_end(addr, end);
			continue;
		}

		if (!kvm_pmd_huge(*pmd)) {
			pte = pte_offset_kernel(pmd, addr);
			clear_pte_entry(kvm, pte, addr);
			next = addr + PAGE_SIZE;
		}

		/*
		 * If the pmd entry is to be cleared, walk back up the ladder
		 */
		if (kvm_pmd_huge(*pmd) || page_empty(pte)) {
			clear_pmd_entry(kvm, pmd, addr);
			next = pmd_addr_end(addr, end);
			if (page_empty(pmd) && !page_empty(pud)) {
				clear_pud_entry(kvm, pud, addr);
				next = pud_addr_end(addr, end);
			}
		}

		addr = next;
	}
}

> > +
> >                 pte = pte_offset_kernel(pmd, addr);
> >                 clear_pte_entry(kvm, pte, addr);
> >                 next = addr + PAGE_SIZE;
> > @@ -420,6 +456,46 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
> >         kvm->arch.pgd = NULL;
> >  }
> > 
> > +/* Set a huge page pmd entry */
> 
> Is it only for huge PMDs? Or can we try to share that code with what is
> in stage2_set_pte?
> 

We can share it, much nicer actually:


diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index e2c58f5..345eefc 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -456,26 +447,33 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
 	kvm->arch.pgd = NULL;
 }
 
-/* Set a huge page pmd entry */
-static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
-			  phys_addr_t addr, const pmd_t *new_pmd)
+static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
+			     phys_addr_t addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
-	pmd_t *pmd, old_pmd;
+	pmd_t *pmd;
 
-	/* Create stage-2 page table mapping - Level 1 */
 	pgd = kvm->arch.pgd + pgd_index(addr);
 	pud = pud_offset(pgd, addr);
 	if (pud_none(*pud)) {
 		if (!cache)
-			return 0; /* ignore calls from kvm_set_spte_hva */
+			return NULL;
 		pmd = mmu_memory_cache_alloc(cache);
 		pud_populate(NULL, pud, pmd);
 		get_page(virt_to_page(pud));
 	}
 
-	pmd = pmd_offset(pud, addr);
+	return pmd_offset(pud, addr);
+}
+
+static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
+			       *cache, phys_addr_t addr, const pmd_t *new_pmd)
+{
+	pmd_t *pmd, old_pmd;
+
+	pmd = stage2_get_pmd(kvm, cache, addr);
+	VM_BUG_ON(!pmd);
 
 	/*
 	 * Mapping in huge pages should only happen through a fault.  If a
@@ -500,25 +498,20 @@ static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 			  phys_addr_t addr, const pte_t *new_pte, bool iomap)
 {
-	pgd_t *pgd;
-	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte, old_pte;
 
-	/* Create 2nd stage page table mapping - Level 1 */
-	pgd = kvm->arch.pgd + pgd_index(addr);
-	pud = pud_offset(pgd, addr);
-	if (pud_none(*pud)) {
-		if (!cache)
-			return 0; /* ignore calls from kvm_set_spte_hva */
-		pmd = mmu_memory_cache_alloc(cache);
-		pud_populate(NULL, pud, pmd);
-		get_page(virt_to_page(pud));
+	/* Create stage-2 page table mapping - Level 1 */
+	pmd = stage2_get_pmd(kvm, cache, addr);
+	if (!pmd) {
+		/*
+		 * Ignore calls from kvm_set_spte_hva for unallocated
+		 * address ranges.
+		 */
+		return 0;
 	}
 
-	pmd = pmd_offset(pud, addr);
-
-	/* Create 2nd stage page mappings - Level 2 */
+	/* Create stage-2 page mappings - Level 2 */
 	if (pmd_none(*pmd)) {
 		if (!cache)
 			return 0; /* ignore calls from kvm_set_spte_hva */
@@ -688,7 +681,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			kvm_set_s2pmd_writable(&new_pmd);
 			kvm_set_pfn_dirty(pfn);
 		}
-		ret = stage2_set_pmd(kvm, memcache, fault_ipa, &new_pmd);
+		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
 	} else {
 		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
 		if (writable) {

> > +static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> > +                         phys_addr_t addr, const pmd_t *new_pmd)
> > +{
> > +       pgd_t *pgd;
> > +       pud_t *pud;
> > +       pmd_t *pmd, old_pmd;
> > +
> > +       /* Create stage-2 page table mapping - Level 1 */
> > +       pgd = kvm->arch.pgd + pgd_index(addr);
> > +       pud = pud_offset(pgd, addr);
> > +       if (pud_none(*pud)) {
> > +               if (!cache)
> > +                       return 0; /* ignore calls from kvm_set_spte_hva */
> 
> Do we still need this?
> 

We didn't need it the way the code was structured in the initial
verison, but we do with sharing the code with stage2_set_pte (see
above).

> > +               pmd = mmu_memory_cache_alloc(cache);
> > +               pud_populate(NULL, pud, pmd);
> > +               get_page(virt_to_page(pud));
> > +       }
> > +
> > +       pmd = pmd_offset(pud, addr);
> > +
> > +       /*
> > +        * Mapping in huge pages should only happen through a fault.  If a
> > +        * page is merged into a transparent huge page, the individual
> > +        * subpages of that huge page should be unmapped through MMU
> > +        * notifiers before we get here.
> > +        *
> > +        * Merging of CompoundPages is not supported; they should become
> > +        * splitting first, unmapped, merged, and mapped back in on-demand.
> > +        */
> > +       VM_BUG_ON(pmd_present(*pmd) && pmd_pfn(*pmd) != pmd_pfn(*new_pmd));
> > +
> > +       old_pmd = *pmd;
> > +       kvm_set_pmd(pmd, *new_pmd);
> > +       if (pmd_present(old_pmd))
> > +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> > +       else
> > +               get_page(virt_to_page(pmd));
> > +       return 0;
> > +}
> > 
> >  static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> >                           phys_addr_t addr, const pte_t *new_pte, bool iomap)
> > @@ -442,7 +518,7 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> > 
> >         pmd = pmd_offset(pud, addr);
> > 
> > -       /* Create 2nd stage page table mapping - Level 2 */
> > +       /* Create 2nd stage page mappings - Level 2 */
> >         if (pmd_none(*pmd)) {
> >                 if (!cache)
> >                         return 0; /* ignore calls from kvm_set_spte_hva */
> > @@ -508,16 +584,53 @@ out:
> >         return ret;
> >  }
> > 
> > +static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
> > +{
> > +       pfn_t pfn = *pfnp;
> > +       gfn_t gfn = *ipap >> PAGE_SHIFT;
> > +
> > +       if (PageTransCompound(pfn_to_page(pfn))) {
> > +               unsigned long mask;
> > +               /*
> > +                * mmu_notifier_retry was successful and we hold the
> > +                * mmu_lock here, so the pmd can't become splitting
> > +                * from under us, and in turn
> > +                * __split_huge_page_refcount() can't run from under
> > +                * us and we can safely transfer the refcount from
> > +                * PG_tail to PG_head as we switch the pfn from tail to
> > +                * head.
> > +                */
> 
> -ECANTPARSE. Well, I sort of can, but this deserves a clearer explanation.
> 

I'll try to clarify.  (BTW. I stole this comment from Avi, blame the
both of us :))


> > +               mask = (PMD_SIZE / PAGE_SIZE) - 1;
> 
> 		mask = PTRS_PER_PMD -1;
> 
> > +               VM_BUG_ON((gfn & mask) != (pfn & mask));
> > +               if (pfn & mask) {
> > +                       gfn &= ~mask;
> 
> This doesn't seem to be used later on.
> 
> > +                       *ipap &= ~(PMD_SIZE - 1);
> 
> 			*ipap &= ~PMD_MASK;
> 

ack to all three

> > +                       kvm_release_pfn_clean(pfn);
> > +                       pfn &= ~mask;
> > +                       kvm_get_pfn(pfn);
> > +                       *pfnp = pfn;
> > +               }
> 
> 
> 
> > +               return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +
> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > -                         gfn_t gfn, struct kvm_memory_slot *memslot,
> > +                         struct kvm_memory_slot *memslot,
> >                           unsigned long fault_status)
> >  {
> > -       pte_t new_pte;
> > -       pfn_t pfn;
> >         int ret;
> > -       bool write_fault, writable;
> > +       bool write_fault, writable, hugetlb = false, force_pte = false;
> >         unsigned long mmu_seq;
> > +       gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +       unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
> > +       struct kvm *kvm = vcpu->kvm;
> >         struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> > +       struct vm_area_struct *vma;
> > +       pfn_t pfn;
> > +       unsigned long psize;
> > 
> >         write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
> >         if (fault_status == FSC_PERM && !write_fault) {
> > @@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 return -EFAULT;
> >         }
> > 
> > +       /* Let's check if we will get back a huge page */
> > +       down_read(&current->mm->mmap_sem);
> > +       vma = find_vma_intersection(current->mm, hva, hva + 1);
> > +       if (is_vm_hugetlb_page(vma)) {
> > +               hugetlb = true;
> > +               hva &= PMD_MASK;
> > +               gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> > +               psize = PMD_SIZE;
> > +       } else {
> > +               psize = PAGE_SIZE;
> > +               if (vma->vm_start & ~PMD_MASK)
> > +                       force_pte = true;
> > +       }
> > +       up_read(&current->mm->mmap_sem);
> > +
> > +       pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> > +       if (is_error_pfn(pfn))
> > +               return -EFAULT;
> 
> How does this work with respect to the comment that talks about reading
> mmu_seq before calling gfp_to_pfn_prot? Either the comment is wrong, or
> we read this too early.
> 

It doesn't, the gfn_to_pfn_prot call should be moved below the smp_rmb,
I must have simply messed this up when rebasing the patch.  Nice catch!

> > +       coherent_icache_guest_page(kvm, hva, psize);
> > +
> >         /* We need minimum second+third level pages */
> >         ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
> >         if (ret)
> > @@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >          */
> >         smp_rmb();
> > 
> > -       pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
> > -       if (is_error_pfn(pfn))
> > -               return -EFAULT;
> > -
> > -       new_pte = pfn_pte(pfn, PAGE_S2);
> > -       coherent_icache_guest_page(vcpu->kvm, gfn);
> > -
> > -       spin_lock(&vcpu->kvm->mmu_lock);
> > -       if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> > +       spin_lock(&kvm->mmu_lock);
> > +       if (mmu_notifier_retry(kvm, mmu_seq))
> >                 goto out_unlock;
> > -       if (writable) {
> > -               kvm_set_s2pte_writable(&new_pte);
> > -               kvm_set_pfn_dirty(pfn);
> > +       if (!hugetlb && !force_pte)
> > +               hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> 
> How do we ensure that there is no race between this pte being promoted
> to a pmd and another page allocation in the same pmd somewhere else in
> the system? We only hold the kvm lock here, so there must be some extra
> implicit guarantee somewhere...
> 

Steve answered this one, but the point is that any page in this pmd
cannot be allocated from anywhere, because those pages are not free.  At
this point, if the anon user page is mapped as a THP, it will stay as
such until we unlock the mmu_lock.

> > +       if (hugetlb) {
> > +               pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
> > +               new_pmd = pmd_mkhuge(new_pmd);
> > +               if (writable) {
> > +                       kvm_set_s2pmd_writable(&new_pmd);
> > +                       kvm_set_pfn_dirty(pfn);
> > +               }
> > +               ret = stage2_set_pmd(kvm, memcache, fault_ipa, &new_pmd);
> > +       } else {
> > +               pte_t new_pte = pfn_pte(pfn, PAGE_S2);
> > +               if (writable) {
> > +                       kvm_set_s2pte_writable(&new_pte);
> > +                       kvm_set_pfn_dirty(pfn);
> > +               }
> > +               ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
> >         }
> > -       stage2_set_pte(vcpu->kvm, memcache, fault_ipa, &new_pte, false);
> > +
> > 
> >  out_unlock:
> > -       spin_unlock(&vcpu->kvm->mmu_lock);
> > +       spin_unlock(&kvm->mmu_lock);
> >         kvm_release_pfn_clean(pfn);
> > -       return 0;
> > +       return ret;
> >  }
> > 
> >  /**
> > @@ -630,7 +772,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
> > 
> >         memslot = gfn_to_memslot(vcpu->kvm, gfn);
> > 
> > -       ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot, fault_status);
> > +       ret = user_mem_abort(vcpu, fault_ipa, memslot, fault_status);
> >         if (ret == 0)
> >                 ret = 1;
> >  out_unlock:
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index efe609c..e7f3852 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -91,6 +91,7 @@ int kvm_mmu_init(void);
> >  void kvm_clear_hyp_idmap(void);
> > 
> >  #define        kvm_set_pte(ptep, pte)          set_pte(ptep, pte)
> > +#define        kvm_set_pmd(pmdp, pmd)          set_pmd(pmdp, pmd)
> > 
> >  static inline bool kvm_is_write_fault(unsigned long esr)
> >  {
> > @@ -116,13 +117,18 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
> >         pte_val(*pte) |= PTE_S2_RDWR;
> >  }
> > 
> > +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
> > +{
> > +       pmd_val(*pmd) |= PTE_S2_RDWR;
> > +}
> > +
> >  struct kvm;
> > 
> > -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> > +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
> > +                                             unsigned long size)
> >  {
> >         if (!icache_is_aliasing()) {            /* PIPT */
> > -               unsigned long hva = gfn_to_hva(kvm, gfn);
> > -               flush_icache_range(hva, hva + PAGE_SIZE);
> > +               flush_icache_range(hva, hva + size);
> >         } else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
> >                 /* any kind of VIPT cache */
> >                 __flush_icache_all();
> 
> As a general comment about this patch, I think having both transparent
> pages *and* hugetlbfs support in the same patch makes it fairly hard to
> read.
> 
> I understand that is is actually convenient as most of the code is
> shared, but I think it would make it easier to read if the patch was
> split in two: first the basic hugetlb support, and then the THP madness.
> 
> What do you think?
> 
Sure, I can split them up.  I'm going to send out a v2 with the split
patches so you can read that in its completeness instead of the
fragments I posted above.  I can always spin a v3 if you have more
comments, though I still hope to get this in for the next merge window.

Thanks for the review!
-Christoffer

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
@ 2013-10-02 22:36       ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-10-02 22:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:
> Hi Christoffer,
> 
> Finally taking some time to review this patch. Sorry for the delay...
> 
> On 09/08/13 05:07, Christoffer Dall wrote:
> > From: Christoffer Dall <cdall@cs.columbia.edu>
> > 
> > Support transparent huge pages in KVM/ARM 32-bit and 64-bit.  The whole
> > transparent_hugepage_adjust stuff is far from pretty, but this is how
> > it's solved on x86 so we duplicate their logic.  This should be shared
> > across architectures if possible (like many other things), but can
> > always be changed down the road.
> > 
> > The pud_huge checking on the unmap path may feel a bit silly as the
> > pud_huge check is always defined to false, but the compiler should be
> > smart about this.
> > 
> > Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> > ---
> >  arch/arm/include/asm/kvm_mmu.h   |   17 +++-
> >  arch/arm/kvm/mmu.c               |  200 ++++++++++++++++++++++++++++++++------
> >  arch/arm64/include/asm/kvm_mmu.h |   12 ++-
> >  3 files changed, 194 insertions(+), 35 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> > index 472ac70..ff3ff67 100644
> > --- a/arch/arm/include/asm/kvm_mmu.h
> > +++ b/arch/arm/include/asm/kvm_mmu.h
> > @@ -62,6 +62,12 @@ phys_addr_t kvm_get_idmap_vector(void);
> >  int kvm_mmu_init(void);
> >  void kvm_clear_hyp_idmap(void);
> > 
> > +static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
> > +{
> > +       *pmd = new_pmd;
> > +       flush_pmd_entry(pmd);
> > +}
> > +
> >  static inline void kvm_set_pte(pte_t *pte, pte_t new_pte)
> >  {
> >         pte_val(*pte) = new_pte;
> > @@ -103,9 +109,15 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
> >         pte_val(*pte) |= L_PTE_S2_RDWR;
> >  }
> > 
> > +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
> > +{
> > +       pmd_val(*pmd) |= L_PTE_S2_RDWR;
> > +}
> > +
> >  struct kvm;
> > 
> > -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> > +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
> > +                                             unsigned long size)
> >  {
> >         /*
> >          * If we are going to insert an instruction page and the icache is
> > @@ -120,8 +132,7 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> >          * need any kind of flushing (DDI 0406C.b - Page B3-1392).
> >          */
> >         if (icache_is_pipt()) {
> > -               unsigned long hva = gfn_to_hva(kvm, gfn);
> > -               __cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
> > +               __cpuc_coherent_user_range(hva, hva + size);
> >         } else if (!icache_is_vivt_asid_tagged()) {
> >                 /* any kind of VIPT cache */
> >                 __flush_icache_all();
> > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> > index 0988d9e..26ced77 100644
> > --- a/arch/arm/kvm/mmu.c
> > +++ b/arch/arm/kvm/mmu.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/kvm_host.h>
> >  #include <linux/io.h>
> > +#include <linux/hugetlb.h>
> >  #include <trace/events/kvm.h>
> >  #include <asm/pgalloc.h>
> >  #include <asm/cacheflush.h>
> > @@ -41,6 +42,8 @@ static unsigned long hyp_idmap_start;
> >  static unsigned long hyp_idmap_end;
> >  static phys_addr_t hyp_idmap_vector;
> > 
> > +#define kvm_pmd_huge(_x)       (pmd_huge(_x) || pmd_trans_huge(_x))
> > +
> >  static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
> >  {
> >         /*
> > @@ -93,19 +96,27 @@ static bool page_empty(void *ptr)
> > 
> >  static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> >  {
> > -       pmd_t *pmd_table = pmd_offset(pud, 0);
> > -       pud_clear(pud);
> > -       kvm_tlb_flush_vmid_ipa(kvm, addr);
> > -       pmd_free(NULL, pmd_table);
> > +       if (pud_huge(*pud)) {
> > +               pud_clear(pud);
> > +       } else {
> > +               pmd_t *pmd_table = pmd_offset(pud, 0);
> > +               pud_clear(pud);
> > +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> > +               pmd_free(NULL, pmd_table);
> > +       }
> 
> Why don't we have any TLB invalidation on the huge path?
> 

Because this code was written before we fixed the TLB invalidation race.
Good that you spotted this.

> >         put_page(virt_to_page(pud));
> >  }
> > 
> >  static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
> >  {
> > -       pte_t *pte_table = pte_offset_kernel(pmd, 0);
> > -       pmd_clear(pmd);
> > -       kvm_tlb_flush_vmid_ipa(kvm, addr);
> > -       pte_free_kernel(NULL, pte_table);
> > +       if (kvm_pmd_huge(*pmd)) {
> > +               pmd_clear(pmd);
> > +       } else {
> > +               pte_t *pte_table = pte_offset_kernel(pmd, 0);
> > +               pmd_clear(pmd);
> > +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> > +               pte_free_kernel(NULL, pte_table);
> > +       }
> 
> Same here.
> 

Same here.

> >         put_page(virt_to_page(pmd));
> >  }
> > 
> > @@ -136,12 +147,37 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
> >                         continue;
> >                 }
> > 
> > +               if (pud_huge(*pud)) {
> > +                       /*
> > +                        * If we are dealing with a huge pud, just clear it and
> > +                        * move on.
> > +                        */
> > +                       clear_pud_entry(kvm, pud, addr);
> > +                       addr = pud_addr_end(addr, end);
> > +                       continue;
> > +               }
> > +
> >                 pmd = pmd_offset(pud, addr);
> >                 if (pmd_none(*pmd)) {
> >                         addr = pmd_addr_end(addr, end);
> >                         continue;
> >                 }
> > 
> > +               if (kvm_pmd_huge(*pmd)) {
> > +                       /*
> > +                        * If we are dealing with a huge pmd, just clear it and
> > +                        * walk back up the ladder.
> > +                        */
> > +                       clear_pmd_entry(kvm, pmd, addr);
> > +                       next = pmd_addr_end(addr, end);
> > +                       if (page_empty(pmd) && !page_empty(pud)) {
> > +                               clear_pud_entry(kvm, pud, addr);
> > +                               next = pud_addr_end(addr, end);
> > +                       }
> > +                       addr = next;
> > +                       continue;
> > +               }
> 
> Looks like we already have the exact same sequence a few lines below.
> Surely we can factor it out.
> 

Yeah, the thing is having a funciton here requires us to pass in a bunch
of things, an either return the next addr or pass that by reference, and
while it does save a few lines, I found the code harder to read.  This
is another option:


static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
			unsigned long long start, u64 size)
{
	pgd_t *pgd;
	pud_t *pud;
	pmd_t *pmd;
	pte_t *pte;
	unsigned long long addr = start, end = start + size;
	u64 next;

	while (addr < end) {
		pgd = pgdp + pgd_index(addr);
		pud = pud_offset(pgd, addr);
		if (pud_none(*pud)) {
			addr = pud_addr_end(addr, end);
			continue;
		}

		if (pud_huge(*pud)) {
			/*
			 * If we are dealing with a huge pud, just clear it and
			 * move on.
			 */
			clear_pud_entry(kvm, pud, addr);
			addr = pud_addr_end(addr, end);
			continue;
		}

		pmd = pmd_offset(pud, addr);
		if (pmd_none(*pmd)) {
			addr = pmd_addr_end(addr, end);
			continue;
		}

		if (!kvm_pmd_huge(*pmd)) {
			pte = pte_offset_kernel(pmd, addr);
			clear_pte_entry(kvm, pte, addr);
			next = addr + PAGE_SIZE;
		}

		/*
		 * If the pmd entry is to be cleared, walk back up the ladder
		 */
		if (kvm_pmd_huge(*pmd) || page_empty(pte)) {
			clear_pmd_entry(kvm, pmd, addr);
			next = pmd_addr_end(addr, end);
			if (page_empty(pmd) && !page_empty(pud)) {
				clear_pud_entry(kvm, pud, addr);
				next = pud_addr_end(addr, end);
			}
		}

		addr = next;
	}
}

> > +
> >                 pte = pte_offset_kernel(pmd, addr);
> >                 clear_pte_entry(kvm, pte, addr);
> >                 next = addr + PAGE_SIZE;
> > @@ -420,6 +456,46 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
> >         kvm->arch.pgd = NULL;
> >  }
> > 
> > +/* Set a huge page pmd entry */
> 
> Is it only for huge PMDs? Or can we try to share that code with what is
> in stage2_set_pte?
> 

We can share it, much nicer actually:


diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index e2c58f5..345eefc 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -456,26 +447,33 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
 	kvm->arch.pgd = NULL;
 }
 
-/* Set a huge page pmd entry */
-static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
-			  phys_addr_t addr, const pmd_t *new_pmd)
+static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
+			     phys_addr_t addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
-	pmd_t *pmd, old_pmd;
+	pmd_t *pmd;
 
-	/* Create stage-2 page table mapping - Level 1 */
 	pgd = kvm->arch.pgd + pgd_index(addr);
 	pud = pud_offset(pgd, addr);
 	if (pud_none(*pud)) {
 		if (!cache)
-			return 0; /* ignore calls from kvm_set_spte_hva */
+			return NULL;
 		pmd = mmu_memory_cache_alloc(cache);
 		pud_populate(NULL, pud, pmd);
 		get_page(virt_to_page(pud));
 	}
 
-	pmd = pmd_offset(pud, addr);
+	return pmd_offset(pud, addr);
+}
+
+static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
+			       *cache, phys_addr_t addr, const pmd_t *new_pmd)
+{
+	pmd_t *pmd, old_pmd;
+
+	pmd = stage2_get_pmd(kvm, cache, addr);
+	VM_BUG_ON(!pmd);
 
 	/*
 	 * Mapping in huge pages should only happen through a fault.  If a
@@ -500,25 +498,20 @@ static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 			  phys_addr_t addr, const pte_t *new_pte, bool iomap)
 {
-	pgd_t *pgd;
-	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte, old_pte;
 
-	/* Create 2nd stage page table mapping - Level 1 */
-	pgd = kvm->arch.pgd + pgd_index(addr);
-	pud = pud_offset(pgd, addr);
-	if (pud_none(*pud)) {
-		if (!cache)
-			return 0; /* ignore calls from kvm_set_spte_hva */
-		pmd = mmu_memory_cache_alloc(cache);
-		pud_populate(NULL, pud, pmd);
-		get_page(virt_to_page(pud));
+	/* Create stage-2 page table mapping - Level 1 */
+	pmd = stage2_get_pmd(kvm, cache, addr);
+	if (!pmd) {
+		/*
+		 * Ignore calls from kvm_set_spte_hva for unallocated
+		 * address ranges.
+		 */
+		return 0;
 	}
 
-	pmd = pmd_offset(pud, addr);
-
-	/* Create 2nd stage page mappings - Level 2 */
+	/* Create stage-2 page mappings - Level 2 */
 	if (pmd_none(*pmd)) {
 		if (!cache)
 			return 0; /* ignore calls from kvm_set_spte_hva */
@@ -688,7 +681,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			kvm_set_s2pmd_writable(&new_pmd);
 			kvm_set_pfn_dirty(pfn);
 		}
-		ret = stage2_set_pmd(kvm, memcache, fault_ipa, &new_pmd);
+		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
 	} else {
 		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
 		if (writable) {

> > +static int stage2_set_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> > +                         phys_addr_t addr, const pmd_t *new_pmd)
> > +{
> > +       pgd_t *pgd;
> > +       pud_t *pud;
> > +       pmd_t *pmd, old_pmd;
> > +
> > +       /* Create stage-2 page table mapping - Level 1 */
> > +       pgd = kvm->arch.pgd + pgd_index(addr);
> > +       pud = pud_offset(pgd, addr);
> > +       if (pud_none(*pud)) {
> > +               if (!cache)
> > +                       return 0; /* ignore calls from kvm_set_spte_hva */
> 
> Do we still need this?
> 

We didn't need it the way the code was structured in the initial
verison, but we do with sharing the code with stage2_set_pte (see
above).

> > +               pmd = mmu_memory_cache_alloc(cache);
> > +               pud_populate(NULL, pud, pmd);
> > +               get_page(virt_to_page(pud));
> > +       }
> > +
> > +       pmd = pmd_offset(pud, addr);
> > +
> > +       /*
> > +        * Mapping in huge pages should only happen through a fault.  If a
> > +        * page is merged into a transparent huge page, the individual
> > +        * subpages of that huge page should be unmapped through MMU
> > +        * notifiers before we get here.
> > +        *
> > +        * Merging of CompoundPages is not supported; they should become
> > +        * splitting first, unmapped, merged, and mapped back in on-demand.
> > +        */
> > +       VM_BUG_ON(pmd_present(*pmd) && pmd_pfn(*pmd) != pmd_pfn(*new_pmd));
> > +
> > +       old_pmd = *pmd;
> > +       kvm_set_pmd(pmd, *new_pmd);
> > +       if (pmd_present(old_pmd))
> > +               kvm_tlb_flush_vmid_ipa(kvm, addr);
> > +       else
> > +               get_page(virt_to_page(pmd));
> > +       return 0;
> > +}
> > 
> >  static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> >                           phys_addr_t addr, const pte_t *new_pte, bool iomap)
> > @@ -442,7 +518,7 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> > 
> >         pmd = pmd_offset(pud, addr);
> > 
> > -       /* Create 2nd stage page table mapping - Level 2 */
> > +       /* Create 2nd stage page mappings - Level 2 */
> >         if (pmd_none(*pmd)) {
> >                 if (!cache)
> >                         return 0; /* ignore calls from kvm_set_spte_hva */
> > @@ -508,16 +584,53 @@ out:
> >         return ret;
> >  }
> > 
> > +static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
> > +{
> > +       pfn_t pfn = *pfnp;
> > +       gfn_t gfn = *ipap >> PAGE_SHIFT;
> > +
> > +       if (PageTransCompound(pfn_to_page(pfn))) {
> > +               unsigned long mask;
> > +               /*
> > +                * mmu_notifier_retry was successful and we hold the
> > +                * mmu_lock here, so the pmd can't become splitting
> > +                * from under us, and in turn
> > +                * __split_huge_page_refcount() can't run from under
> > +                * us and we can safely transfer the refcount from
> > +                * PG_tail to PG_head as we switch the pfn from tail to
> > +                * head.
> > +                */
> 
> -ECANTPARSE. Well, I sort of can, but this deserves a clearer explanation.
> 

I'll try to clarify.  (BTW. I stole this comment from Avi, blame the
both of us :))


> > +               mask = (PMD_SIZE / PAGE_SIZE) - 1;
> 
> 		mask = PTRS_PER_PMD -1;
> 
> > +               VM_BUG_ON((gfn & mask) != (pfn & mask));
> > +               if (pfn & mask) {
> > +                       gfn &= ~mask;
> 
> This doesn't seem to be used later on.
> 
> > +                       *ipap &= ~(PMD_SIZE - 1);
> 
> 			*ipap &= ~PMD_MASK;
> 

ack to all three

> > +                       kvm_release_pfn_clean(pfn);
> > +                       pfn &= ~mask;
> > +                       kvm_get_pfn(pfn);
> > +                       *pfnp = pfn;
> > +               }
> 
> 
> 
> > +               return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +
> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > -                         gfn_t gfn, struct kvm_memory_slot *memslot,
> > +                         struct kvm_memory_slot *memslot,
> >                           unsigned long fault_status)
> >  {
> > -       pte_t new_pte;
> > -       pfn_t pfn;
> >         int ret;
> > -       bool write_fault, writable;
> > +       bool write_fault, writable, hugetlb = false, force_pte = false;
> >         unsigned long mmu_seq;
> > +       gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +       unsigned long hva = gfn_to_hva(vcpu->kvm, gfn);
> > +       struct kvm *kvm = vcpu->kvm;
> >         struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> > +       struct vm_area_struct *vma;
> > +       pfn_t pfn;
> > +       unsigned long psize;
> > 
> >         write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
> >         if (fault_status == FSC_PERM && !write_fault) {
> > @@ -525,6 +638,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 return -EFAULT;
> >         }
> > 
> > +       /* Let's check if we will get back a huge page */
> > +       down_read(&current->mm->mmap_sem);
> > +       vma = find_vma_intersection(current->mm, hva, hva + 1);
> > +       if (is_vm_hugetlb_page(vma)) {
> > +               hugetlb = true;
> > +               hva &= PMD_MASK;
> > +               gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> > +               psize = PMD_SIZE;
> > +       } else {
> > +               psize = PAGE_SIZE;
> > +               if (vma->vm_start & ~PMD_MASK)
> > +                       force_pte = true;
> > +       }
> > +       up_read(&current->mm->mmap_sem);
> > +
> > +       pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> > +       if (is_error_pfn(pfn))
> > +               return -EFAULT;
> 
> How does this work with respect to the comment that talks about reading
> mmu_seq before calling gfp_to_pfn_prot? Either the comment is wrong, or
> we read this too early.
> 

It doesn't, the gfn_to_pfn_prot call should be moved below the smp_rmb,
I must have simply messed this up when rebasing the patch.  Nice catch!

> > +       coherent_icache_guest_page(kvm, hva, psize);
> > +
> >         /* We need minimum second+third level pages */
> >         ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
> >         if (ret)
> > @@ -542,26 +676,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >          */
> >         smp_rmb();
> > 
> > -       pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
> > -       if (is_error_pfn(pfn))
> > -               return -EFAULT;
> > -
> > -       new_pte = pfn_pte(pfn, PAGE_S2);
> > -       coherent_icache_guest_page(vcpu->kvm, gfn);
> > -
> > -       spin_lock(&vcpu->kvm->mmu_lock);
> > -       if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> > +       spin_lock(&kvm->mmu_lock);
> > +       if (mmu_notifier_retry(kvm, mmu_seq))
> >                 goto out_unlock;
> > -       if (writable) {
> > -               kvm_set_s2pte_writable(&new_pte);
> > -               kvm_set_pfn_dirty(pfn);
> > +       if (!hugetlb && !force_pte)
> > +               hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> 
> How do we ensure that there is no race between this pte being promoted
> to a pmd and another page allocation in the same pmd somewhere else in
> the system? We only hold the kvm lock here, so there must be some extra
> implicit guarantee somewhere...
> 

Steve answered this one, but the point is that any page in this pmd
cannot be allocated from anywhere, because those pages are not free.  At
this point, if the anon user page is mapped as a THP, it will stay as
such until we unlock the mmu_lock.

> > +       if (hugetlb) {
> > +               pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
> > +               new_pmd = pmd_mkhuge(new_pmd);
> > +               if (writable) {
> > +                       kvm_set_s2pmd_writable(&new_pmd);
> > +                       kvm_set_pfn_dirty(pfn);
> > +               }
> > +               ret = stage2_set_pmd(kvm, memcache, fault_ipa, &new_pmd);
> > +       } else {
> > +               pte_t new_pte = pfn_pte(pfn, PAGE_S2);
> > +               if (writable) {
> > +                       kvm_set_s2pte_writable(&new_pte);
> > +                       kvm_set_pfn_dirty(pfn);
> > +               }
> > +               ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
> >         }
> > -       stage2_set_pte(vcpu->kvm, memcache, fault_ipa, &new_pte, false);
> > +
> > 
> >  out_unlock:
> > -       spin_unlock(&vcpu->kvm->mmu_lock);
> > +       spin_unlock(&kvm->mmu_lock);
> >         kvm_release_pfn_clean(pfn);
> > -       return 0;
> > +       return ret;
> >  }
> > 
> >  /**
> > @@ -630,7 +772,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
> > 
> >         memslot = gfn_to_memslot(vcpu->kvm, gfn);
> > 
> > -       ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot, fault_status);
> > +       ret = user_mem_abort(vcpu, fault_ipa, memslot, fault_status);
> >         if (ret == 0)
> >                 ret = 1;
> >  out_unlock:
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index efe609c..e7f3852 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -91,6 +91,7 @@ int kvm_mmu_init(void);
> >  void kvm_clear_hyp_idmap(void);
> > 
> >  #define        kvm_set_pte(ptep, pte)          set_pte(ptep, pte)
> > +#define        kvm_set_pmd(pmdp, pmd)          set_pmd(pmdp, pmd)
> > 
> >  static inline bool kvm_is_write_fault(unsigned long esr)
> >  {
> > @@ -116,13 +117,18 @@ static inline void kvm_set_s2pte_writable(pte_t *pte)
> >         pte_val(*pte) |= PTE_S2_RDWR;
> >  }
> > 
> > +static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
> > +{
> > +       pmd_val(*pmd) |= PTE_S2_RDWR;
> > +}
> > +
> >  struct kvm;
> > 
> > -static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> > +static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
> > +                                             unsigned long size)
> >  {
> >         if (!icache_is_aliasing()) {            /* PIPT */
> > -               unsigned long hva = gfn_to_hva(kvm, gfn);
> > -               flush_icache_range(hva, hva + PAGE_SIZE);
> > +               flush_icache_range(hva, hva + size);
> >         } else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
> >                 /* any kind of VIPT cache */
> >                 __flush_icache_all();
> 
> As a general comment about this patch, I think having both transparent
> pages *and* hugetlbfs support in the same patch makes it fairly hard to
> read.
> 
> I understand that is is actually convenient as most of the code is
> shared, but I think it would make it easier to read if the patch was
> split in two: first the basic hugetlb support, and then the THP madness.
> 
> What do you think?
> 
Sure, I can split them up.  I'm going to send out a v2 with the split
patches so you can read that in its completeness instead of the
fragments I posted above.  I can always spin a v3 if you have more
comments, though I still hope to get this in for the next merge window.

Thanks for the review!
-Christoffer

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
  2013-09-23 10:11     ` Marc Zyngier
@ 2013-10-03 20:33       ` Christoffer Dall
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-10-03 20:33 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, kvm, linux-arm-kernel, linaro-kernel, patches, Christoffer Dall

On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:

[...]

> > 
> > +static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
> > +{
> > +       pfn_t pfn = *pfnp;
> > +       gfn_t gfn = *ipap >> PAGE_SHIFT;
> > +
> > +       if (PageTransCompound(pfn_to_page(pfn))) {
> > +               unsigned long mask;
> > +               /*
> > +                * mmu_notifier_retry was successful and we hold the
> > +                * mmu_lock here, so the pmd can't become splitting
> > +                * from under us, and in turn
> > +                * __split_huge_page_refcount() can't run from under
> > +                * us and we can safely transfer the refcount from
> > +                * PG_tail to PG_head as we switch the pfn from tail to
> > +                * head.
> > +                */
> 
> -ECANTPARSE. Well, I sort of can, but this deserves a clearer explanation.
> 
> > +               mask = (PMD_SIZE / PAGE_SIZE) - 1;
> 
> 		mask = PTRS_PER_PMD -1;
> 
> > +               VM_BUG_ON((gfn & mask) != (pfn & mask));
> > +               if (pfn & mask) {
> > +                       gfn &= ~mask;
> 
> This doesn't seem to be used later on.
> 
> > +                       *ipap &= ~(PMD_SIZE - 1);
> 
> 			*ipap &= ~PMD_MASK;
> 
damn, I trust you too much, you surely meant
			*ipap &= PMD_MASK;

right?

> > +                       kvm_release_pfn_clean(pfn);
> > +                       pfn &= ~mask;
> > +                       kvm_get_pfn(pfn);
> > +                       *pfnp = pfn;
> > +               }
> 
> 
> 
> > +               return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +

-Christoffer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
@ 2013-10-03 20:33       ` Christoffer Dall
  0 siblings, 0 replies; 36+ messages in thread
From: Christoffer Dall @ 2013-10-03 20:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:

[...]

> > 
> > +static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
> > +{
> > +       pfn_t pfn = *pfnp;
> > +       gfn_t gfn = *ipap >> PAGE_SHIFT;
> > +
> > +       if (PageTransCompound(pfn_to_page(pfn))) {
> > +               unsigned long mask;
> > +               /*
> > +                * mmu_notifier_retry was successful and we hold the
> > +                * mmu_lock here, so the pmd can't become splitting
> > +                * from under us, and in turn
> > +                * __split_huge_page_refcount() can't run from under
> > +                * us and we can safely transfer the refcount from
> > +                * PG_tail to PG_head as we switch the pfn from tail to
> > +                * head.
> > +                */
> 
> -ECANTPARSE. Well, I sort of can, but this deserves a clearer explanation.
> 
> > +               mask = (PMD_SIZE / PAGE_SIZE) - 1;
> 
> 		mask = PTRS_PER_PMD -1;
> 
> > +               VM_BUG_ON((gfn & mask) != (pfn & mask));
> > +               if (pfn & mask) {
> > +                       gfn &= ~mask;
> 
> This doesn't seem to be used later on.
> 
> > +                       *ipap &= ~(PMD_SIZE - 1);
> 
> 			*ipap &= ~PMD_MASK;
> 
damn, I trust you too much, you surely meant
			*ipap &= PMD_MASK;

right?

> > +                       kvm_release_pfn_clean(pfn);
> > +                       pfn &= ~mask;
> > +                       kvm_get_pfn(pfn);
> > +                       *pfnp = pfn;
> > +               }
> 
> 
> 
> > +               return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +

-Christoffer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs  support
  2013-10-03 20:33       ` Christoffer Dall
@ 2013-10-04  9:23         ` Marc Zyngier
  -1 siblings, 0 replies; 36+ messages in thread
From: Marc Zyngier @ 2013-10-04  9:23 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Christoffer Dall, linaro-kernel, kvm, patches, kvmarm, linux-arm-kernel

On 2013-10-03 21:33, Christoffer Dall wrote:
> On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:
>
> [...]
>
>> >
>> > +static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t 
>> *ipap)
>> > +{
>> > +       pfn_t pfn = *pfnp;
>> > +       gfn_t gfn = *ipap >> PAGE_SHIFT;
>> > +
>> > +       if (PageTransCompound(pfn_to_page(pfn))) {
>> > +               unsigned long mask;
>> > +               /*
>> > +                * mmu_notifier_retry was successful and we hold 
>> the
>> > +                * mmu_lock here, so the pmd can't become 
>> splitting
>> > +                * from under us, and in turn
>> > +                * __split_huge_page_refcount() can't run from 
>> under
>> > +                * us and we can safely transfer the refcount from
>> > +                * PG_tail to PG_head as we switch the pfn from 
>> tail to
>> > +                * head.
>> > +                */
>>
>> -ECANTPARSE. Well, I sort of can, but this deserves a clearer 
>> explanation.
>>
>> > +               mask = (PMD_SIZE / PAGE_SIZE) - 1;
>>
>> 		mask = PTRS_PER_PMD -1;
>>
>> > +               VM_BUG_ON((gfn & mask) != (pfn & mask));
>> > +               if (pfn & mask) {
>> > +                       gfn &= ~mask;
>>
>> This doesn't seem to be used later on.
>>
>> > +                       *ipap &= ~(PMD_SIZE - 1);
>>
>> 			*ipap &= ~PMD_MASK;
>>
> damn, I trust you too much, you surely meant
> 			*ipap &= PMD_MASK;
>
> right?

Indeed. Just keeping you on your toes ;-)

         M.
-- 
Fast, cheap, reliable. Pick two.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support
@ 2013-10-04  9:23         ` Marc Zyngier
  0 siblings, 0 replies; 36+ messages in thread
From: Marc Zyngier @ 2013-10-04  9:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 2013-10-03 21:33, Christoffer Dall wrote:
> On Mon, Sep 23, 2013 at 11:11:07AM +0100, Marc Zyngier wrote:
>
> [...]
>
>> >
>> > +static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t 
>> *ipap)
>> > +{
>> > +       pfn_t pfn = *pfnp;
>> > +       gfn_t gfn = *ipap >> PAGE_SHIFT;
>> > +
>> > +       if (PageTransCompound(pfn_to_page(pfn))) {
>> > +               unsigned long mask;
>> > +               /*
>> > +                * mmu_notifier_retry was successful and we hold 
>> the
>> > +                * mmu_lock here, so the pmd can't become 
>> splitting
>> > +                * from under us, and in turn
>> > +                * __split_huge_page_refcount() can't run from 
>> under
>> > +                * us and we can safely transfer the refcount from
>> > +                * PG_tail to PG_head as we switch the pfn from 
>> tail to
>> > +                * head.
>> > +                */
>>
>> -ECANTPARSE. Well, I sort of can, but this deserves a clearer 
>> explanation.
>>
>> > +               mask = (PMD_SIZE / PAGE_SIZE) - 1;
>>
>> 		mask = PTRS_PER_PMD -1;
>>
>> > +               VM_BUG_ON((gfn & mask) != (pfn & mask));
>> > +               if (pfn & mask) {
>> > +                       gfn &= ~mask;
>>
>> This doesn't seem to be used later on.
>>
>> > +                       *ipap &= ~(PMD_SIZE - 1);
>>
>> 			*ipap &= ~PMD_MASK;
>>
> damn, I trust you too much, you surely meant
> 			*ipap &= PMD_MASK;
>
> right?

Indeed. Just keeping you on your toes ;-)

         M.
-- 
Fast, cheap, reliable. Pick two.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-10-04  9:23 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-09  4:07 [PATCH 0/3] KVM/ARM Huge pages support Christoffer Dall
2013-08-09  4:07 ` Christoffer Dall
2013-08-09  4:07 ` [PATCH 1/3] KVM: Move gfn_to_index to x86 specific code Christoffer Dall
2013-08-09  4:07   ` Christoffer Dall
2013-08-09  4:07 ` [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_XXX defines Christoffer Dall
2013-08-09  4:07   ` Christoffer Dall
2013-08-09  4:07 ` [PATCH 3/3] KVM: ARM: Transparent huge pages and hugetlbfs support Christoffer Dall
2013-08-09  4:07   ` Christoffer Dall
2013-09-23 10:11   ` Marc Zyngier
2013-09-23 10:11     ` Marc Zyngier
2013-09-23 14:46     ` Marc Zyngier
2013-09-23 14:46       ` Marc Zyngier
2013-09-24 15:41     ` Steve Capper
2013-09-24 15:41       ` Steve Capper
2013-10-02 22:36     ` Christoffer Dall
2013-10-02 22:36       ` Christoffer Dall
2013-10-03 20:33     ` Christoffer Dall
2013-10-03 20:33       ` Christoffer Dall
2013-10-04  9:23       ` Marc Zyngier
2013-10-04  9:23         ` Marc Zyngier
2013-08-09 14:30 ` [PATCH 2/3] KVM: ARM: Get rid of KVM_HPAGE_ defines Christoffer Dall
2013-08-09 14:30   ` Christoffer Dall
2013-08-25 14:05   ` Gleb Natapov
2013-08-25 14:05     ` Gleb Natapov
2013-08-25 14:29     ` Peter Maydell
2013-08-25 14:29       ` Peter Maydell
2013-08-25 14:48       ` Gleb Natapov
2013-08-25 14:48         ` Gleb Natapov
2013-08-25 15:18         ` Peter Maydell
2013-08-25 15:18           ` Peter Maydell
2013-08-25 15:27           ` Alexander Graf
2013-08-25 15:27             ` Alexander Graf
2013-08-26 10:55             ` Gleb Natapov
2013-08-26 10:55               ` Gleb Natapov
2013-08-26  0:46           ` Christoffer Dall
2013-08-26  0:46             ` Christoffer Dall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.