linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX
@ 2020-01-08 20:24 Sean Christopherson
  2020-01-08 20:24 ` [PATCH 01/14] KVM: x86/mmu: Enforce max_level on HugeTLB mappings Sean Christopherson
                   ` (14 more replies)
  0 siblings, 15 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

This series is a mix of bug fixes, cleanup and new support in KVM's
handling of huge pages.  The series initially stemmed from a syzkaller
bug report[1], which is fixed by patch 02, "mm: thp: KVM: Explicitly
check for THP when populating secondary MMU".

While investigating options for fixing the syzkaller bug, I realized KVM
could reuse the approach from Barret's series to enable huge pages for DAX
mappings in KVM[2] for all types of huge mappings, i.e. walk the host page
tables instead of querying metadata (patches 05 - 09).

Walking the host page tables sidesteps the issues with refcounting and
identifying THP mappings (in theory), and using a common method for
identifying huge mappings should improve (haven't actually measured) KVM's
overall page fault latency by eliminating the vma lookup that is currently
used to identify HugeTLB mappings.  Eliminating the HugeTLB specific code
also allows for additional cleanup (patches 10 - 13).

Testing the page walk approach revealed several pre-existing bugs that
are included here (patches 01, 03 and 04) because the changes interact
with the rest of the series, e.g. without the read-only memslots fix,
walking the host page tables without explicitly filtering out HugeTLB
mappings would pick up read-only memslots and introduce a completely
unintended functional change.

Lastly, with the page walk infrastructure in place, supporting DAX-based
huge mappings becomes a trivial change (patch 14).

Based on kvm/queue, commit e41a90be9659 ("KVM: x86/mmu: WARN if root_hpa
is invalid when handling a page fault")

Paolo, assuming I understand your workflow, patch 01 can be squashed with
the buggy commit as it's still sitting in kvm/queue.

[1] https://lkml.kernel.org/r/0000000000003cffc30599d3d1a0@google.com
[2] https://lkml.kernel.org/r/20191212182238.46535-1-brho@google.com

Sean Christopherson (14):
  KVM: x86/mmu: Enforce max_level on HugeTLB mappings
  mm: thp: KVM: Explicitly check for THP when populating secondary MMU
  KVM: Use vcpu-specific gva->hva translation when querying host page
    size
  KVM: Play nice with read-only memslots when querying host page size
  x86/mm: Introduce lookup_address_in_mm()
  KVM: x86/mmu: Refactor THP adjust to prep for changing query
  KVM: x86/mmu: Walk host page tables to find THP mappings
  KVM: x86/mmu: Drop level optimization from fast_page_fault()
  KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings
  KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)
  KVM: x86/mmu: Zap any compound page when collapsing sptes
  KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()
  KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte()
  KVM: x86/mmu: Use huge pages for DAX-backed files

 arch/powerpc/kvm/book3s_xive_native.c |   2 +-
 arch/x86/include/asm/pgtable_types.h  |   4 +
 arch/x86/kvm/mmu/mmu.c                | 208 ++++++++++----------------
 arch/x86/kvm/mmu/paging_tmpl.h        |  29 +---
 arch/x86/mm/pageattr.c                |  11 ++
 include/linux/huge_mm.h               |   6 +
 include/linux/kvm_host.h              |   3 +-
 mm/huge_memory.c                      |  11 ++
 virt/kvm/arm/mmu.c                    |   8 +-
 virt/kvm/kvm_main.c                   |  24 ++-
 10 files changed, 145 insertions(+), 161 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/14] KVM: x86/mmu: Enforce max_level on HugeTLB mappings
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 02/14] mm: thp: KVM: Explicitly check for THP when populating secondary MMU Sean Christopherson
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Limit KVM's mapping level for HugeTLB based on its calculated max_level.
The max_level check prior to invoking host_mapping_level() only filters
out the case where KVM cannot create a 2mb mapping, it doesn't handle
the scenario where KVM can create a 2mb but not 1gb mapping, and the
host is using a 1gb HugeTLB mapping.

Fixes: ad163aa8903d ("KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7269130ea5e2..8e822c09170d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1330,7 +1330,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
 			 int *max_levelp)
 {
-	int max_level = *max_levelp;
+	int host_level, max_level = *max_levelp;
 	struct kvm_memory_slot *slot;
 
 	if (unlikely(max_level == PT_PAGE_TABLE_LEVEL))
@@ -1362,7 +1362,8 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
 	 * So, do not propagate host_mapping_level() to max_level as KVM can
 	 * still promote the guest mapping to a huge page in the THP case.
 	 */
-	return host_mapping_level(vcpu->kvm, large_gfn);
+	host_level = host_mapping_level(vcpu->kvm, large_gfn);
+	return min(host_level, max_level);
 }
 
 /*
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/14] mm: thp: KVM: Explicitly check for THP when populating secondary MMU
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
  2020-01-08 20:24 ` [PATCH 01/14] KVM: x86/mmu: Enforce max_level on HugeTLB mappings Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 03/14] KVM: Use vcpu-specific gva->hva translation when querying host page size Sean Christopherson
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Add a helper, is_transparent_hugepage(), to explicitly check whether a
compound page is a THP and use it when populating KVM's secondary MMU.
The explicit check fixes a bug where a remapped compound page, e.g. for
an XDP Rx socket, is mapped into a KVM guest and is mistaken for a THP,
which results in KVM incorrectly creating a huge page in its secondary
MMU.

Fixes: 936a5fe6e6148 ("thp: kvm mmu transparent hugepage support")
Reported-by: syzbot+c9d1fb51ac9d0d10c39d@syzkaller.appspotmail.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c   |  4 ++--
 include/linux/huge_mm.h  |  6 ++++++
 include/linux/kvm_host.h |  1 +
 mm/huge_memory.c         | 11 +++++++++++
 virt/kvm/arm/mmu.c       |  8 +-------
 virt/kvm/kvm_main.c      | 10 ++++++++++
 6 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8e822c09170d..ca14c84c4f4b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3344,7 +3344,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
 	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompoundMap(pfn_to_page(pfn))) {
+	    kvm_is_transparent_hugepage(pfn)) {
 		unsigned long mask;
 
 		/*
@@ -5959,7 +5959,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 */
 		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
 		    !kvm_is_zone_device_pfn(pfn) &&
-		    PageTransCompoundMap(pfn_to_page(pfn))) {
+		    kvm_is_transparent_hugepage(pfn)) {
 			pte_list_remove(rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93d5cf0bc716..5e154fad2f98 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,6 +160,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
+bool is_transparent_hugepage(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
@@ -310,6 +311,11 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 
 static inline void prep_transhuge_page(struct page *page) {}
 
+static inline bool is_transparent_hugepage(struct page *page)
+{
+	return false;
+}
+
 #define transparent_hugepage_flags 0UL
 
 #define thp_get_unmapped_area	NULL
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 339de08e5fa2..411b71a02f25 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -985,6 +985,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
 bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
+bool kvm_is_transparent_hugepage(kvm_pfn_t pfn);
 
 struct kvm_irq_ack_notifier {
 	struct hlist_node link;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 13cc93785006..94c85a5da041 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -527,6 +527,17 @@ void prep_transhuge_page(struct page *page)
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
+bool is_transparent_hugepage(struct page *page)
+{
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	return is_huge_zero_page(page) ||
+	       page[1].compound_dtor == TRANSHUGE_PAGE_DTOR;
+}
+EXPORT_SYMBOL_GPL(is_transparent_hugepage);
+
 static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
 		loff_t off, unsigned long flags, unsigned long size)
 {
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 38b4c910b6c3..6e29d0c5062c 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1372,14 +1372,8 @@ static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
 {
 	kvm_pfn_t pfn = *pfnp;
 	gfn_t gfn = *ipap >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(pfn);
 
-	/*
-	 * PageTransCompoundMap() returns true for THP and
-	 * hugetlbfs. Make sure the adjustment is done only for THP
-	 * pages.
-	 */
-	if (!PageHuge(page) && PageTransCompoundMap(page)) {
+	if (kvm_is_transparent_hugepage(pfn)) {
 		unsigned long mask;
 		/*
 		 * The address we faulted on is backed by a transparent huge
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3aa21bec028d..e8ca8bf12320 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -191,6 +191,16 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 	return true;
 }
 
+bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	if (!PageTransCompoundMap(page))
+		return false;
+
+	return is_transparent_hugepage(compound_head(page));
+}
+
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
  */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/14] KVM: Use vcpu-specific gva->hva translation when querying host page size
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
  2020-01-08 20:24 ` [PATCH 01/14] KVM: x86/mmu: Enforce max_level on HugeTLB mappings Sean Christopherson
  2020-01-08 20:24 ` [PATCH 02/14] mm: thp: KVM: Explicitly check for THP when populating secondary MMU Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 04/14] KVM: Play nice with read-only memslots " Sean Christopherson
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
correct set of memslots is used when handling x86 page faults in SMM.

Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/powerpc/kvm/book3s_xive_native.c | 2 +-
 arch/x86/kvm/mmu/mmu.c                | 6 +++---
 include/linux/kvm_host.h              | 2 +-
 virt/kvm/kvm_main.c                   | 4 ++--
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index d83adb1e1490..6ef0151ff70a 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -631,7 +631,7 @@ static int kvmppc_xive_native_set_queue_config(struct kvmppc_xive *xive,
 	srcu_idx = srcu_read_lock(&kvm->srcu);
 	gfn = gpa_to_gfn(kvm_eq.qaddr);
 
-	page_size = kvm_host_page_size(kvm, gfn);
+	page_size = kvm_host_page_size(vcpu, gfn);
 	if (1ull << kvm_eq.qshift > page_size) {
 		srcu_read_unlock(&kvm->srcu, srcu_idx);
 		pr_warn("Incompatible host page size %lx!\n", page_size);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ca14c84c4f4b..8ca6cd04cdf1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1286,12 +1286,12 @@ static bool mmu_gfn_lpage_is_disallowed(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return __mmu_gfn_lpage_is_disallowed(gfn, level, slot);
 }
 
-static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
+static int host_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	unsigned long page_size;
 	int i, ret = 0;
 
-	page_size = kvm_host_page_size(kvm, gfn);
+	page_size = kvm_host_page_size(vcpu, gfn);
 
 	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
 		if (page_size >= KVM_HPAGE_SIZE(i))
@@ -1362,7 +1362,7 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
 	 * So, do not propagate host_mapping_level() to max_level as KVM can
 	 * still promote the guest mapping to a huge page in the THP case.
 	 */
-	host_level = host_mapping_level(vcpu->kvm, large_gfn);
+	host_level = host_mapping_level(vcpu, large_gfn);
 	return min(host_level, max_level);
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 411b71a02f25..3c0ceec45ebd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -767,7 +767,7 @@ int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
-unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
+unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e8ca8bf12320..5f7f06824c2b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1416,14 +1416,14 @@ bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
 
-unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn)
+unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	struct vm_area_struct *vma;
 	unsigned long addr, size;
 
 	size = PAGE_SIZE;
 
-	addr = gfn_to_hva(kvm, gfn);
+	addr = kvm_vcpu_gfn_to_hva(vcpu, gfn);
 	if (kvm_is_error_hva(addr))
 		return PAGE_SIZE;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/14] KVM: Play nice with read-only memslots when querying host page size
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (2 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 03/14] KVM: Use vcpu-specific gva->hva translation when querying host page size Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-21 14:24   ` Paolo Bonzini
  2020-01-08 20:24 ` [PATCH 05/14] x86/mm: Introduce lookup_address_in_mm() Sean Christopherson
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Open code an equivalent of kvm_vcpu_gfn_to_memslot() when querying the
host page size to avoid the "writable" check in __gfn_to_hva_many(),
which will always fail on read-only memslots due to gfn_to_hva()
assuming writes.  Functionally, this allows x86 to create large mappings
for read-only memslots that are backed by HugeTLB mappings.

Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page
support") states "If the largepage contains write-protected pages, a
large pte is not used.", but "write-protected" refers to pages that are
temporarily read-only, e.g. read-only memslots didn't even exist at the
time.

Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 virt/kvm/kvm_main.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5f7f06824c2b..d9aced677ddd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1418,15 +1418,23 @@ EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
 
 unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
+	struct kvm_memory_slot *slot;
 	struct vm_area_struct *vma;
 	unsigned long addr, size;
 
 	size = PAGE_SIZE;
 
-	addr = kvm_vcpu_gfn_to_hva(vcpu, gfn);
-	if (kvm_is_error_hva(addr))
+	/*
+	 * Manually do the equivalent of kvm_vcpu_gfn_to_hva() to avoid the
+	 * "writable" check in __gfn_to_hva_many(), which will always fail on
+	 * read-only memslots due to gfn_to_hva() assuming writes.
+	 */
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return PAGE_SIZE;
 
+	addr = __gfn_to_hva_memslot(slot, gfn);
+
 	down_read(&current->mm->mmap_sem);
 	vma = find_vma(current->mm, addr);
 	if (!vma)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/14] x86/mm: Introduce lookup_address_in_mm()
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (3 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 04/14] KVM: Play nice with read-only memslots " Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-09 21:04   ` Thomas Gleixner
  2020-01-08 20:24 ` [PATCH 06/14] KVM: x86/mmu: Refactor THP adjust to prep for changing query Sean Christopherson
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Add a helper, lookup_address_in_mm(), to traverse the page tables of a
given mm struct.  KVM will use the helper to retrieve the host mapping
level, e.g. 4k vs. 2mb vs. 1gb, of a compound (or DAX-backed) page
without having to resort to implementation specific metadata.  E.g. KVM
currently uses different logic for HugeTLB vs. THP, and would add a
third variant for DAX-backed files.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/include/asm/pgtable_types.h |  4 ++++
 arch/x86/mm/pageattr.c               | 11 +++++++++++
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b5e49e6bac63..400ac8da75e8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -561,6 +561,10 @@ static inline void update_page_count(int level, unsigned long pages) { }
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
 extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 				    unsigned int *level);
+
+struct mm_struct;
+pte_t *lookup_address_in_mm(struct mm_struct *mm, unsigned long address,
+			    unsigned int *level);
 extern pmd_t *lookup_pmd_address(unsigned long address);
 extern phys_addr_t slow_virt_to_phys(void *__address);
 extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn,
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 0d09cc5aad61..8787fec876e4 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -618,6 +618,17 @@ pte_t *lookup_address(unsigned long address, unsigned int *level)
 }
 EXPORT_SYMBOL_GPL(lookup_address);
 
+/*
+ * Lookup the page table entry for a virtual address in a given mm. Return a
+ * pointer to the entry and the level of the mapping.
+ */
+pte_t *lookup_address_in_mm(struct mm_struct *mm, unsigned long address,
+			    unsigned int *level)
+{
+	return lookup_address_in_pgd(pgd_offset(mm, address), address, level);
+}
+EXPORT_SYMBOL_GPL(lookup_address_in_mm);
+
 static pte_t *_lookup_address_cpa(struct cpa_data *cpa, unsigned long address,
 				  unsigned int *level)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/14] KVM: x86/mmu: Refactor THP adjust to prep for changing query
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (4 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 05/14] x86/mm: Introduce lookup_address_in_mm() Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 07/14] KVM: x86/mmu: Walk host page tables to find THP mappings Sean Christopherson
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Refactor transparent_hugepage_adjust() in preparation for walking the
host page tables to identify hugepage mappings, initially for THP pages,
and eventualy for HugeTLB and DAX-backed pages as well.  The latter
cases support 1gb pages, i.e. the adjustment logic needs access to the
max allowed level.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 44 +++++++++++++++++-----------------
 arch/x86/kvm/mmu/paging_tmpl.h |  3 +--
 2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8ca6cd04cdf1..30836899be73 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3329,33 +3329,34 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
-static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
-					gfn_t gfn, kvm_pfn_t *pfnp,
+static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
+					int max_level, kvm_pfn_t *pfnp,
 					int *levelp)
 {
 	kvm_pfn_t pfn = *pfnp;
 	int level = *levelp;
+	kvm_pfn_t mask;
+
+	if (max_level == PT_PAGE_TABLE_LEVEL || level > PT_PAGE_TABLE_LEVEL)
+		return;
+
+	if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn) ||
+	    kvm_is_zone_device_pfn(pfn))
+		return;
+
+	if (!kvm_is_transparent_hugepage(pfn))
+		return;
+
+	level = PT_DIRECTORY_LEVEL;
 
 	/*
-	 * Check if it's a transparent hugepage. If this would be an
-	 * hugetlbfs page, level wouldn't be set to
-	 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
-	 * here.
+	 * mmu_notifier_retry() was successful and mmu_lock is held, so
+	 * the pmd can't be split from under us.
 	 */
-	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
-	    kvm_is_transparent_hugepage(pfn)) {
-		unsigned long mask;
-
-		/*
-		 * mmu_notifier_retry() was successful and mmu_lock is held, so
-		 * the pmd can't be split from under us.
-		 */
-		*levelp = level = PT_DIRECTORY_LEVEL;
-		mask = KVM_PAGES_PER_HPAGE(level) - 1;
-		VM_BUG_ON((gfn & mask) != (pfn & mask));
-		*pfnp = pfn & ~mask;
-	}
+	*levelp = level;
+	mask = KVM_PAGES_PER_HPAGE(level) - 1;
+	VM_BUG_ON((gfn & mask) != (pfn & mask));
+	*pfnp = pfn & ~mask;
 }
 
 static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
@@ -3395,8 +3396,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
 		return RET_PF_RETRY;
 
-	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
-		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
+	transparent_hugepage_adjust(vcpu, gfn, max_level, &pfn, &level);
 
 	trace_kvm_mmu_spte_requested(gpa, level, pfn);
 	for_each_shadow_entry(vcpu, gpa, it) {
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index b53bed3c901c..0029f7870865 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -673,8 +673,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 	gfn = gw->gfn | ((addr & PT_LVL_OFFSET_MASK(gw->level)) >> PAGE_SHIFT);
 	base_gfn = gfn;
 
-	if (max_level > PT_PAGE_TABLE_LEVEL)
-		transparent_hugepage_adjust(vcpu, gw->gfn, &pfn, &hlevel);
+	transparent_hugepage_adjust(vcpu, gw->gfn, max_level, &pfn, &hlevel);
 
 	trace_kvm_mmu_spte_requested(addr, gw->level, pfn);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/14] KVM: x86/mmu: Walk host page tables to find THP mappings
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (5 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 06/14] KVM: x86/mmu: Refactor THP adjust to prep for changing query Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-21 14:40   ` Paolo Bonzini
  2020-01-08 20:24 ` [PATCH 08/14] KVM: x86/mmu: Drop level optimization from fast_page_fault() Sean Christopherson
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Explicitly walk the host page tables to identify THP mappings instead
of relying solely on the metadata in struct page.  This sets the stage
for using a common method of identifying huge mappings regardless of the
underlying implementation (HugeTLB vs THB vs DAX), and hopefully avoids
the pitfalls of relying on metadata to identify THP mappings, e.g. see
commit 169226f7e0d2 ("mm: thp: handle page cache THP correctly in
PageTransCompoundMap") and the need for KVM to explicitly check for a
THP compound page.  KVM will also naturally work with 1gb THP pages, if
they are ever supported.

Walking the tables for THP mappings is likely marginally slower than
querying metadata, but a future patch will reuse the walk to identify
HugeTLB mappings, at which point eliminating the existing VMA lookup for
HugeTLB will make this a net positive.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Barret Rhoden <brho@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 30836899be73..4bd7f745b56d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3329,6 +3329,41 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
+static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
+				  kvm_pfn_t pfn)
+{
+	struct kvm_memory_slot *slot;
+	unsigned long hva;
+	pte_t *pte;
+	int level;
+
+	BUILD_BUG_ON(PT_PAGE_TABLE_LEVEL != (int)PG_LEVEL_4K ||
+		     PT_DIRECTORY_LEVEL != (int)PG_LEVEL_2M ||
+		     PT_PDPE_LEVEL != (int)PG_LEVEL_1G);
+
+	if (!PageCompound(pfn_to_page(pfn)))
+		return PT_PAGE_TABLE_LEVEL;
+
+	/*
+	 * Manually do the equivalent of kvm_vcpu_gfn_to_hva() to avoid the
+	 * "writable" check in __gfn_to_hva_many(), which will always fail on
+	 * read-only memslots due to gfn_to_hva() assuming writes.  Earlier
+	 * page fault steps have already verified the guest isn't writing a
+	 * read-only memslot.
+	 */
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	if (!memslot_valid_for_gpte(slot, true))
+		return PT_PAGE_TABLE_LEVEL;
+
+	hva = __gfn_to_hva_memslot(slot, gfn);
+
+	pte = lookup_address_in_mm(vcpu->kvm->mm, hva, &level);
+	if (unlikely(!pte))
+		return PT_PAGE_TABLE_LEVEL;
+
+	return level;
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 					int max_level, kvm_pfn_t *pfnp,
 					int *levelp)
@@ -3344,10 +3379,11 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 	    kvm_is_zone_device_pfn(pfn))
 		return;
 
-	if (!kvm_is_transparent_hugepage(pfn))
+	level = host_pfn_mapping_level(vcpu, gfn, pfn);
+	if (level == PT_PAGE_TABLE_LEVEL)
 		return;
 
-	level = PT_DIRECTORY_LEVEL;
+	level = min(level, max_level);
 
 	/*
 	 * mmu_notifier_retry() was successful and mmu_lock is held, so
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/14] KVM: x86/mmu: Drop level optimization from fast_page_fault()
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (6 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 07/14] KVM: x86/mmu: Walk host page tables to find THP mappings Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 09/14] KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings Sean Christopherson
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Remove fast_page_fault()'s optimization to stop the shadow walk if the
iterator level drops below the intended map level.  The intended map
level is only acccurate for HugeTLB mappings (THP mappings are detected
after fast_page_fault()), i.e. it's not required for correctness, and
a future patch will also move HugeTLB mapping detection to after
fast_page_fault().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4bd7f745b56d..7d78d1d996ed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3593,7 +3593,7 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
  * - true: let the vcpu to access on the same address again.
  * - false: let the real page fault path to fix it.
  */
-static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int level,
+static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 			    u32 error_code)
 {
 	struct kvm_shadow_walk_iterator iterator;
@@ -3611,8 +3611,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int level,
 		u64 new_spte;
 
 		for_each_shadow_entry_lockless(vcpu, cr2_or_gpa, iterator, spte)
-			if (!is_shadow_present_pte(spte) ||
-			    iterator.level < level)
+			if (!is_shadow_present_pte(spte))
 				break;
 
 		sp = page_header(__pa(iterator.sptep));
@@ -4223,7 +4222,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (level > PT_PAGE_TABLE_LEVEL)
 		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
 
-	if (fast_page_fault(vcpu, gpa, level, error_code))
+	if (fast_page_fault(vcpu, gpa, error_code))
 		return RET_PF_RETRY;
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/14] KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (7 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 08/14] KVM: x86/mmu: Drop level optimization from fast_page_fault() Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 10/14] KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch) Sean Christopherson
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Remove KVM's HugeTLB specific logic and instead rely on walking the host
page tables (already done for THP) to identify HugeTLB mappings.
Eliminating the HugeTLB-only logic avoids taking mmap_sem and calling
find_vma() for all hugepage compatible page faults, and simplifies KVM's
page fault code by consolidating all hugepage adjustments into a common
helper.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 84 ++++++++++------------------------
 arch/x86/kvm/mmu/paging_tmpl.h | 15 +++---
 2 files changed, 29 insertions(+), 70 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7d78d1d996ed..68aec984f953 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1286,23 +1286,6 @@ static bool mmu_gfn_lpage_is_disallowed(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return __mmu_gfn_lpage_is_disallowed(gfn, level, slot);
 }
 
-static int host_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn)
-{
-	unsigned long page_size;
-	int i, ret = 0;
-
-	page_size = kvm_host_page_size(vcpu, gfn);
-
-	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
-		if (page_size >= KVM_HPAGE_SIZE(i))
-			ret = i;
-		else
-			break;
-	}
-
-	return ret;
-}
-
 static inline bool memslot_valid_for_gpte(struct kvm_memory_slot *slot,
 					  bool no_dirty_log)
 {
@@ -1327,43 +1310,25 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return slot;
 }
 
-static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
-			 int *max_levelp)
+static int max_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
+			     int max_level)
 {
-	int host_level, max_level = *max_levelp;
 	struct kvm_memory_slot *slot;
 
 	if (unlikely(max_level == PT_PAGE_TABLE_LEVEL))
 		return PT_PAGE_TABLE_LEVEL;
 
-	slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn);
-	if (!memslot_valid_for_gpte(slot, true)) {
-		*max_levelp = PT_PAGE_TABLE_LEVEL;
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	if (!memslot_valid_for_gpte(slot, true))
 		return PT_PAGE_TABLE_LEVEL;
-	}
 
 	max_level = min(max_level, kvm_x86_ops->get_lpage_level());
 	for ( ; max_level > PT_PAGE_TABLE_LEVEL; max_level--) {
-		if (!__mmu_gfn_lpage_is_disallowed(large_gfn, max_level, slot))
+		if (!__mmu_gfn_lpage_is_disallowed(gfn, max_level, slot))
 			break;
 	}
 
-	*max_levelp = max_level;
-
-	if (max_level == PT_PAGE_TABLE_LEVEL)
-		return PT_PAGE_TABLE_LEVEL;
-
-	/*
-	 * Note, host_mapping_level() does *not* handle transparent huge pages.
-	 * As suggested by "mapping", it reflects the page size established by
-	 * the associated vma, if there is one, i.e. host_mapping_level() will
-	 * return a huge page level if and only if a vma exists and the backing
-	 * implementation for the vma uses huge pages, e.g. hugetlbfs and dax.
-	 * So, do not propagate host_mapping_level() to max_level as KVM can
-	 * still promote the guest mapping to a huge page in the THP case.
-	 */
-	host_level = host_mapping_level(vcpu, large_gfn);
-	return min(host_level, max_level);
+	return max_level;
 }
 
 /*
@@ -3137,7 +3102,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 
 		/*
 		 * Other vcpu creates new sp in the window between
-		 * mapping_level() and acquiring mmu-lock. We can
+		 * max_mapping_level() and acquiring mmu-lock. We can
 		 * allow guest to retry the access, the mapping can
 		 * be fixed if guest refault.
 		 */
@@ -3364,24 +3329,23 @@ static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return level;
 }
 
-static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
-					int max_level, kvm_pfn_t *pfnp,
-					int *levelp)
+static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
+				   int max_level, kvm_pfn_t *pfnp)
 {
 	kvm_pfn_t pfn = *pfnp;
-	int level = *levelp;
 	kvm_pfn_t mask;
+	int level;
 
-	if (max_level == PT_PAGE_TABLE_LEVEL || level > PT_PAGE_TABLE_LEVEL)
-		return;
+	if (max_level == PT_PAGE_TABLE_LEVEL)
+		return PT_PAGE_TABLE_LEVEL;
 
 	if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn) ||
 	    kvm_is_zone_device_pfn(pfn))
-		return;
+		return PT_PAGE_TABLE_LEVEL;
 
 	level = host_pfn_mapping_level(vcpu, gfn, pfn);
 	if (level == PT_PAGE_TABLE_LEVEL)
-		return;
+		return level;
 
 	level = min(level, max_level);
 
@@ -3389,10 +3353,11 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 	 * mmu_notifier_retry() was successful and mmu_lock is held, so
 	 * the pmd can't be split from under us.
 	 */
-	*levelp = level;
 	mask = KVM_PAGES_PER_HPAGE(level) - 1;
 	VM_BUG_ON((gfn & mask) != (pfn & mask));
 	*pfnp = pfn & ~mask;
+
+	return level;
 }
 
 static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
@@ -3419,20 +3384,19 @@ static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
-			int map_writable, int level, int max_level,
-			kvm_pfn_t pfn, bool prefault,
-			bool account_disallowed_nx_lpage)
+			int map_writable, int max_level, kvm_pfn_t pfn,
+			bool prefault, bool account_disallowed_nx_lpage)
 {
 	struct kvm_shadow_walk_iterator it;
 	struct kvm_mmu_page *sp;
-	int ret;
+	int level, ret;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	gfn_t base_gfn = gfn;
 
 	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
 		return RET_PF_RETRY;
 
-	transparent_hugepage_adjust(vcpu, gfn, max_level, &pfn, &level);
+	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn);
 
 	trace_kvm_mmu_spte_requested(gpa, level, pfn);
 	for_each_shadow_entry(vcpu, gpa, it) {
@@ -4206,7 +4170,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
 	kvm_pfn_t pfn;
-	int level, r;
+	int r;
 
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
 		return RET_PF_EMULATE;
@@ -4218,9 +4182,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (lpage_disallowed)
 		max_level = PT_PAGE_TABLE_LEVEL;
 
-	level = mapping_level(vcpu, gfn, &max_level);
-	if (level > PT_PAGE_TABLE_LEVEL)
-		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
+	max_level = max_mapping_level(vcpu, gfn, max_level);
 
 	if (fast_page_fault(vcpu, gpa, error_code))
 		return RET_PF_RETRY;
@@ -4240,7 +4202,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	r = __direct_map(vcpu, gpa, write, map_writable, level, max_level, pfn,
+	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
 			 prefault, is_tdp && lpage_disallowed);
 
 out_unlock:
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0029f7870865..841506a55815 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -613,14 +613,14 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
  */
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 			 struct guest_walker *gw,
-			 int write_fault, int hlevel, int max_level,
+			 int write_fault, int max_level,
 			 kvm_pfn_t pfn, bool map_writable, bool prefault,
 			 bool lpage_disallowed)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct kvm_shadow_walk_iterator it;
 	unsigned direct_access, access = gw->pt_access;
-	int top_level, ret;
+	int top_level, hlevel, ret;
 	gfn_t gfn, base_gfn;
 
 	direct_access = gw->pte_access;
@@ -673,7 +673,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 	gfn = gw->gfn | ((addr & PT_LVL_OFFSET_MASK(gw->level)) >> PAGE_SHIFT);
 	base_gfn = gfn;
 
-	transparent_hugepage_adjust(vcpu, gw->gfn, max_level, &pfn, &hlevel);
+	hlevel = kvm_mmu_hugepage_adjust(vcpu, gw->gfn, max_level, &pfn);
 
 	trace_kvm_mmu_spte_requested(addr, gw->level, pfn);
 
@@ -775,7 +775,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	struct guest_walker walker;
 	int r;
 	kvm_pfn_t pfn;
-	int level;
 	unsigned long mmu_seq;
 	bool map_writable, is_self_change_mapping;
 	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
@@ -825,9 +824,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	else
 		max_level = walker.level;
 
-	level = mapping_level(vcpu, walker.gfn, &max_level);
-	if (level > PT_PAGE_TABLE_LEVEL)
-		walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
+	max_level = max_mapping_level(vcpu, walker.gfn, max_level);
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
@@ -867,8 +864,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	r = FNAME(fetch)(vcpu, addr, &walker, write_fault, level, max_level,
-			 pfn, map_writable, prefault, lpage_disallowed);
+	r = FNAME(fetch)(vcpu, addr, &walker, write_fault, max_level, pfn,
+			 map_writable, prefault, lpage_disallowed);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/14] KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (8 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 09/14] KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 11/14] KVM: x86/mmu: Zap any compound page when collapsing sptes Sean Christopherson
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Remove logic to retrieve the original gfn now that HugeTLB mappings are
are identified in FNAME(fetch), i.e. FNAME(page_fault) no longer adjusts
the level or gfn.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/paging_tmpl.h | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 841506a55815..0560982eda8b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -621,7 +621,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 	struct kvm_shadow_walk_iterator it;
 	unsigned direct_access, access = gw->pt_access;
 	int top_level, hlevel, ret;
-	gfn_t gfn, base_gfn;
+	gfn_t base_gfn = gw->gfn;
 
 	direct_access = gw->pte_access;
 
@@ -666,13 +666,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 			link_shadow_page(vcpu, it.sptep, sp);
 	}
 
-	/*
-	 * FNAME(page_fault) might have clobbered the bottom bits of
-	 * gw->gfn, restore them from the virtual address.
-	 */
-	gfn = gw->gfn | ((addr & PT_LVL_OFFSET_MASK(gw->level)) >> PAGE_SHIFT);
-	base_gfn = gfn;
-
 	hlevel = kvm_mmu_hugepage_adjust(vcpu, gw->gfn, max_level, &pfn);
 
 	trace_kvm_mmu_spte_requested(addr, gw->level, pfn);
@@ -684,9 +677,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 		 * We cannot overwrite existing page tables with an NX
 		 * large page, as the leaf could be executable.
 		 */
-		disallowed_hugepage_adjust(it, gfn, &pfn, &hlevel);
+		disallowed_hugepage_adjust(it, gw->gfn, &pfn, &hlevel);
 
-		base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
+		base_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
 		if (it.level == hlevel)
 			break;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 11/14] KVM: x86/mmu: Zap any compound page when collapsing sptes
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (9 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 10/14] KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch) Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 12/14] KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust() Sean Christopherson
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Zap any compound page, e.g. THP or HugeTLB pages, when zapping sptes
that can potentially be converted to huge sptes after disabling dirty
logging on the associated memslot.  Note, this approach could result in
false positives, e.g. if a random compound page is mapped into the
guest, but mapping non-huge compound pages into the guest is far from
the norm, and toggling dirty logging is not a frequent operation.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 68aec984f953..f93b0c5e4170 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5956,7 +5956,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 */
 		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
 		    !kvm_is_zone_device_pfn(pfn) &&
-		    kvm_is_transparent_hugepage(pfn)) {
+		    PageCompound(pfn_to_page(pfn))) {
 			pte_list_remove(rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 12/14] KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (10 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 11/14] KVM: x86/mmu: Zap any compound page when collapsing sptes Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-21 15:12   ` Paolo Bonzini
  2020-01-08 20:24 ` [PATCH 13/14] KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte() Sean Christopherson
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Fold max_mapping_level() into kvm_mmu_hugepage_adjust() now that HugeTLB
mappings are handled in kvm_mmu_hugepage_adjust(), i.e. there isn't a
need to pre-calculate the max mapping level.  Co-locating all hugepage
checks eliminates a memslot lookup, at the cost of performing the
__mmu_gfn_lpage_is_disallowed() checks while holding mmu_lock.

The latency of lpage_is_disallowed() is likely negligible relative to
the rest of the code run while holding mmu_lock, and can be offset to
some extent by eliminating the mmu_gfn_lpage_is_disallowed() check in
set_spte() in a future patch.  Eliminating the check in set_spte() is
made possible by performing the initial lpage_is_disallowed() checks
while holding mmu_lock.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 60 ++++++++++++++--------------------
 arch/x86/kvm/mmu/paging_tmpl.h |  2 --
 2 files changed, 24 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f93b0c5e4170..f2667fe0dc75 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1310,27 +1310,6 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return slot;
 }
 
-static int max_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
-			     int max_level)
-{
-	struct kvm_memory_slot *slot;
-
-	if (unlikely(max_level == PT_PAGE_TABLE_LEVEL))
-		return PT_PAGE_TABLE_LEVEL;
-
-	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	if (!memslot_valid_for_gpte(slot, true))
-		return PT_PAGE_TABLE_LEVEL;
-
-	max_level = min(max_level, kvm_x86_ops->get_lpage_level());
-	for ( ; max_level > PT_PAGE_TABLE_LEVEL; max_level--) {
-		if (!__mmu_gfn_lpage_is_disallowed(gfn, max_level, slot))
-			break;
-	}
-
-	return max_level;
-}
-
 /*
  * About rmap_head encoding:
  *
@@ -3101,10 +3080,11 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (pte_access & ACC_WRITE_MASK) {
 
 		/*
-		 * Other vcpu creates new sp in the window between
-		 * max_mapping_level() and acquiring mmu-lock. We can
-		 * allow guest to retry the access, the mapping can
-		 * be fixed if guest refault.
+		 * Legacy code to handle an obsolete scenario where a different
+		 * vcpu creates new sp in the window between this vcpu's query
+		 * of lpage_is_disallowed() and acquiring mmu_lock.  No longer
+		 * necessary now that lpage_is_disallowed() is called after
+		 * acquiring mmu_lock.
 		 */
 		if (level > PT_PAGE_TABLE_LEVEL &&
 		    mmu_gfn_lpage_is_disallowed(vcpu, gfn, level))
@@ -3295,9 +3275,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 }
 
 static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
-				  kvm_pfn_t pfn)
+				  kvm_pfn_t pfn, struct kvm_memory_slot *slot)
 {
-	struct kvm_memory_slot *slot;
 	unsigned long hva;
 	pte_t *pte;
 	int level;
@@ -3310,16 +3289,13 @@ static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
 		return PT_PAGE_TABLE_LEVEL;
 
 	/*
-	 * Manually do the equivalent of kvm_vcpu_gfn_to_hva() to avoid the
+	 * Note, using the already-retrieved memslot and __gfn_to_hva_memslot()
+	 * is not solely for performance, it's also necessary to avoid the
 	 * "writable" check in __gfn_to_hva_many(), which will always fail on
 	 * read-only memslots due to gfn_to_hva() assuming writes.  Earlier
 	 * page fault steps have already verified the guest isn't writing a
 	 * read-only memslot.
 	 */
-	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	if (!memslot_valid_for_gpte(slot, true))
-		return PT_PAGE_TABLE_LEVEL;
-
 	hva = __gfn_to_hva_memslot(slot, gfn);
 
 	pte = lookup_address_in_mm(vcpu->kvm->mm, hva, &level);
@@ -3332,18 +3308,32 @@ static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
 static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 				   int max_level, kvm_pfn_t *pfnp)
 {
+	struct kvm_memory_slot *slot;
 	kvm_pfn_t pfn = *pfnp;
 	kvm_pfn_t mask;
 	int level;
 
-	if (max_level == PT_PAGE_TABLE_LEVEL)
+	if (unlikely(max_level == PT_PAGE_TABLE_LEVEL))
 		return PT_PAGE_TABLE_LEVEL;
 
 	if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn) ||
 	    kvm_is_zone_device_pfn(pfn))
 		return PT_PAGE_TABLE_LEVEL;
 
-	level = host_pfn_mapping_level(vcpu, gfn, pfn);
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	if (!memslot_valid_for_gpte(slot, true))
+		return PT_PAGE_TABLE_LEVEL;
+
+	max_level = min(max_level, kvm_x86_ops->get_lpage_level());
+	for ( ; max_level > PT_PAGE_TABLE_LEVEL; max_level--) {
+		if (!__mmu_gfn_lpage_is_disallowed(gfn, max_level, slot))
+			break;
+	}
+
+	if (max_level == PT_PAGE_TABLE_LEVEL)
+		return PT_PAGE_TABLE_LEVEL;
+
+	level = host_pfn_mapping_level(vcpu, gfn, pfn, slot);
 	if (level == PT_PAGE_TABLE_LEVEL)
 		return level;
 
@@ -4182,8 +4172,6 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (lpage_disallowed)
 		max_level = PT_PAGE_TABLE_LEVEL;
 
-	max_level = max_mapping_level(vcpu, gfn, max_level);
-
 	if (fast_page_fault(vcpu, gpa, error_code))
 		return RET_PF_RETRY;
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0560982eda8b..ea174d85700a 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -817,8 +817,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	else
 		max_level = walker.level;
 
-	max_level = max_mapping_level(vcpu, walker.gfn, max_level);
-
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 13/14] KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte()
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (11 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 12/14] KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust() Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-08 20:24 ` [PATCH 14/14] KVM: x86/mmu: Use huge pages for DAX-backed files Sean Christopherson
  2020-01-09 19:47 ` [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Barret Rhoden
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Remove the late "lpage is disallowed" check from set_spte() now that the
initial check is performed after acquiring mmu_lock.  Fold the guts of
the remaining helper, __mmu_gfn_lpage_is_disallowed(), into
kvm_mmu_hugepage_adjust() to eliminate the unnecessary slot !NULL check.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 39 +++------------------------------------
 1 file changed, 3 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f2667fe0dc75..1e4e0ac169a7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1264,28 +1264,6 @@ static void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 	list_del(&sp->lpage_disallowed_link);
 }
 
-static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
-					  struct kvm_memory_slot *slot)
-{
-	struct kvm_lpage_info *linfo;
-
-	if (slot) {
-		linfo = lpage_info_slot(gfn, slot, level);
-		return !!linfo->disallow_lpage;
-	}
-
-	return true;
-}
-
-static bool mmu_gfn_lpage_is_disallowed(struct kvm_vcpu *vcpu, gfn_t gfn,
-					int level)
-{
-	struct kvm_memory_slot *slot;
-
-	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	return __mmu_gfn_lpage_is_disallowed(gfn, level, slot);
-}
-
 static inline bool memslot_valid_for_gpte(struct kvm_memory_slot *slot,
 					  bool no_dirty_log)
 {
@@ -3078,18 +3056,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	spte |= (u64)pfn << PAGE_SHIFT;
 
 	if (pte_access & ACC_WRITE_MASK) {
-
-		/*
-		 * Legacy code to handle an obsolete scenario where a different
-		 * vcpu creates new sp in the window between this vcpu's query
-		 * of lpage_is_disallowed() and acquiring mmu_lock.  No longer
-		 * necessary now that lpage_is_disallowed() is called after
-		 * acquiring mmu_lock.
-		 */
-		if (level > PT_PAGE_TABLE_LEVEL &&
-		    mmu_gfn_lpage_is_disallowed(vcpu, gfn, level))
-			goto done;
-
 		spte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;
 
 		/*
@@ -3121,7 +3087,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 set_pte:
 	if (mmu_spte_update(sptep, spte))
 		ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
-done:
 	return ret;
 }
 
@@ -3309,6 +3274,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 				   int max_level, kvm_pfn_t *pfnp)
 {
 	struct kvm_memory_slot *slot;
+	struct kvm_lpage_info *linfo;
 	kvm_pfn_t pfn = *pfnp;
 	kvm_pfn_t mask;
 	int level;
@@ -3326,7 +3292,8 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 	max_level = min(max_level, kvm_x86_ops->get_lpage_level());
 	for ( ; max_level > PT_PAGE_TABLE_LEVEL; max_level--) {
-		if (!__mmu_gfn_lpage_is_disallowed(gfn, max_level, slot))
+		linfo = lpage_info_slot(gfn, slot, max_level);
+		if (!linfo->disallow_lpage)
 			break;
 	}
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 14/14] KVM: x86/mmu: Use huge pages for DAX-backed files
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (12 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 13/14] KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte() Sean Christopherson
@ 2020-01-08 20:24 ` Sean Christopherson
  2020-01-09 19:47 ` [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Barret Rhoden
  14 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2020-01-08 20:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Walk the host page tables to identify hugepage mappings for ZONE_DEVICE
pfns, i.e. DAX pages.  Explicitly query kvm_is_zone_device_pfn() when
deciding whether or not to bother walking the host page tables, as DAX
pages do not set up the head/tail infrastructure, i.e. will return false
for PageCompound() even when using huge pages.

Zap ZONE_DEVICE sptes when disabling dirty logging, e.g. if live
migration fails, to allow KVM to rebuild large pages for DAX-based
mappings.  Presumably DAX favors large pages, and worst case scenario is
a minor performance hit as KVM will need to re-fault all DAX-based
pages.

Suggested-by: Barret Rhoden <brho@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Zeng <jason.zeng@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: linux-nvdimm <linux-nvdimm@lists.01.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e4e0ac169a7..324e1919722f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3250,7 +3250,7 @@ static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
 		     PT_DIRECTORY_LEVEL != (int)PG_LEVEL_2M ||
 		     PT_PDPE_LEVEL != (int)PG_LEVEL_1G);
 
-	if (!PageCompound(pfn_to_page(pfn)))
+	if (!PageCompound(pfn_to_page(pfn)) && !kvm_is_zone_device_pfn(pfn))
 		return PT_PAGE_TABLE_LEVEL;
 
 	/*
@@ -3282,8 +3282,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 	if (unlikely(max_level == PT_PAGE_TABLE_LEVEL))
 		return PT_PAGE_TABLE_LEVEL;
 
-	if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn) ||
-	    kvm_is_zone_device_pfn(pfn))
+	if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn))
 		return PT_PAGE_TABLE_LEVEL;
 
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
@@ -5910,8 +5909,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 * mapping if the indirect sp has level = 1.
 		 */
 		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
-		    !kvm_is_zone_device_pfn(pfn) &&
-		    PageCompound(pfn_to_page(pfn))) {
+		    (kvm_is_zone_device_pfn(pfn) ||
+		     PageCompound(pfn_to_page(pfn)))) {
 			pte_list_remove(rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX
  2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
                   ` (13 preceding siblings ...)
  2020-01-08 20:24 ` [PATCH 14/14] KVM: x86/mmu: Use huge pages for DAX-backed files Sean Christopherson
@ 2020-01-09 19:47 ` Barret Rhoden
  2020-01-21 15:10   ` Paolo Bonzini
  14 siblings, 1 reply; 22+ messages in thread
From: Barret Rhoden @ 2020-01-09 19:47 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Paul Mackerras, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Andrew Morton, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, kvm-ppc, kvm, linux-kernel, linux-mm,
	linux-arm-kernel, kvmarm, syzbot+c9d1fb51ac9d0d10c39d,
	Andrea Arcangeli, Dan Williams, David Hildenbrand, Jason Zeng,
	Dave Jiang, Liran Alon, linux-nvdimm

Hi -

On 1/8/20 3:24 PM, Sean Christopherson wrote:
> This series is a mix of bug fixes, cleanup and new support in KVM's
> handling of huge pages.  The series initially stemmed from a syzkaller
> bug report[1], which is fixed by patch 02, "mm: thp: KVM: Explicitly
> check for THP when populating secondary MMU".
> 
> While investigating options for fixing the syzkaller bug, I realized KVM
> could reuse the approach from Barret's series to enable huge pages for DAX
> mappings in KVM[2] for all types of huge mappings, i.e. walk the host page
> tables instead of querying metadata (patches 05 - 09).

Thanks, Sean.  I tested this patch series out, and it works for me. 
(Huge KVM mappings of a DAX file, etc.).

Thanks,

Barret




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/14] x86/mm: Introduce lookup_address_in_mm()
  2020-01-08 20:24 ` [PATCH 05/14] x86/mm: Introduce lookup_address_in_mm() Sean Christopherson
@ 2020-01-09 21:04   ` Thomas Gleixner
  2020-01-21 14:26     ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2020-01-09 21:04 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Paul Mackerras, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Marc Zyngier,
	James Morse, Julien Thierry, Suzuki K Poulose, kvm-ppc, kvm,
	linux-kernel, linux-mm, linux-arm-kernel, kvmarm,
	syzbot+c9d1fb51ac9d0d10c39d, Andrea Arcangeli, Dan Williams,
	Barret Rhoden, David Hildenbrand, Jason Zeng, Dave Jiang,
	Liran Alon, linux-nvdimm

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index b5e49e6bac63..400ac8da75e8 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -561,6 +561,10 @@ static inline void update_page_count(int level, unsigned long pages) { }
>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>  extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
>  				    unsigned int *level);
> +
> +struct mm_struct;
> +pte_t *lookup_address_in_mm(struct mm_struct *mm, unsigned long address,
> +			    unsigned int *level);

Please keep the file consistent and use extern even if not required.

Other than that:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 04/14] KVM: Play nice with read-only memslots when querying host page size
  2020-01-08 20:24 ` [PATCH 04/14] KVM: Play nice with read-only memslots " Sean Christopherson
@ 2020-01-21 14:24   ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2020-01-21 14:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paul Mackerras, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Andrew Morton, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, kvm-ppc, kvm, linux-kernel, linux-mm,
	linux-arm-kernel, kvmarm, syzbot+c9d1fb51ac9d0d10c39d,
	Andrea Arcangeli, Dan Williams, Barret Rhoden, David Hildenbrand,
	Jason Zeng, Dave Jiang, Liran Alon, linux-nvdimm

On 08/01/20 21:24, Sean Christopherson wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5f7f06824c2b..d9aced677ddd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1418,15 +1418,23 @@ EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
>  
>  unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
> +	struct kvm_memory_slot *slot;
>  	struct vm_area_struct *vma;
>  	unsigned long addr, size;
>  
>  	size = PAGE_SIZE;
>  
> -	addr = kvm_vcpu_gfn_to_hva(vcpu, gfn);
> -	if (kvm_is_error_hva(addr))
> +	/*
> +	 * Manually do the equivalent of kvm_vcpu_gfn_to_hva() to avoid the
> +	 * "writable" check in __gfn_to_hva_many(), which will always fail on
> +	 * read-only memslots due to gfn_to_hva() assuming writes.
> +	 */
> +	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
>  		return PAGE_SIZE;
>  
> +	addr = __gfn_to_hva_memslot(slot, gfn);
> +
>  	down_read(&current->mm->mmap_sem);
>  	vma = find_vma(current->mm, addr);
>  	if (!vma)
> 

Even simpler: use kvm_vcpu_gfn_to_hva_prot

-	addr = kvm_vcpu_gfn_to_hva(vcpu, gfn);
+	addr = kvm_vcpu_gfn_to_hva_prot(vcpu, gfn, NULL);

"You are in a maze of twisty little functions, all alike".

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/14] x86/mm: Introduce lookup_address_in_mm()
  2020-01-09 21:04   ` Thomas Gleixner
@ 2020-01-21 14:26     ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2020-01-21 14:26 UTC (permalink / raw)
  To: Thomas Gleixner, Sean Christopherson
  Cc: Paul Mackerras, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Andrew Morton, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, kvm-ppc, kvm, linux-kernel, linux-mm,
	linux-arm-kernel, kvmarm, syzbot+c9d1fb51ac9d0d10c39d,
	Andrea Arcangeli, Dan Williams, Barret Rhoden, David Hildenbrand,
	Jason Zeng, Dave Jiang, Liran Alon, linux-nvdimm

On 09/01/20 22:04, Thomas Gleixner wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
>> index b5e49e6bac63..400ac8da75e8 100644
>> --- a/arch/x86/include/asm/pgtable_types.h
>> +++ b/arch/x86/include/asm/pgtable_types.h
>> @@ -561,6 +561,10 @@ static inline void update_page_count(int level, unsigned long pages) { }
>>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>>  extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
>>  				    unsigned int *level);
>> +
>> +struct mm_struct;
>> +pte_t *lookup_address_in_mm(struct mm_struct *mm, unsigned long address,
>> +			    unsigned int *level);
> 
> Please keep the file consistent and use extern even if not required.
> 
> Other than that:
> 
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> 

Adjusted, thanks for the review.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 07/14] KVM: x86/mmu: Walk host page tables to find THP mappings
  2020-01-08 20:24 ` [PATCH 07/14] KVM: x86/mmu: Walk host page tables to find THP mappings Sean Christopherson
@ 2020-01-21 14:40   ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2020-01-21 14:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paul Mackerras, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Andrew Morton, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, kvm-ppc, kvm, linux-kernel, linux-mm,
	linux-arm-kernel, kvmarm, syzbot+c9d1fb51ac9d0d10c39d,
	Andrea Arcangeli, Dan Williams, Barret Rhoden, David Hildenbrand,
	Jason Zeng, Dave Jiang, Liran Alon, linux-nvdimm

On 08/01/20 21:24, Sean Christopherson wrote:
> +
> +	/*
> +	 * Manually do the equivalent of kvm_vcpu_gfn_to_hva() to avoid the
> +	 * "writable" check in __gfn_to_hva_many(), which will always fail on
> +	 * read-only memslots due to gfn_to_hva() assuming writes.  Earlier
> +	 * page fault steps have already verified the guest isn't writing a
> +	 * read-only memslot.
> +	 */
> +	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +	if (!memslot_valid_for_gpte(slot, true))
> +		return PT_PAGE_TABLE_LEVEL;
> +
> +	hva = __gfn_to_hva_memslot(slot, gfn);
> +

Using gfn_to_memslot_dirty_bitmap is also a good excuse to avoid
kvm_vcpu_gfn_to_hva.

+	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
+	if (!slot)
+		return PT_PAGE_TABLE_LEVEL;
+
+	hva = __gfn_to_hva_memslot(slot, gfn);

(I am planning to remove gfn_to_hva_memslot so that __gfn_to_hva_memslot
can lose the annoying underscores).

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX
  2020-01-09 19:47 ` [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Barret Rhoden
@ 2020-01-21 15:10   ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2020-01-21 15:10 UTC (permalink / raw)
  To: Barret Rhoden, Sean Christopherson
  Cc: Paul Mackerras, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Andrew Morton, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, kvm-ppc, kvm, linux-kernel, linux-mm,
	linux-arm-kernel, kvmarm, syzbot+c9d1fb51ac9d0d10c39d,
	Andrea Arcangeli, Dan Williams, David Hildenbrand, Jason Zeng,
	Dave Jiang, Liran Alon, linux-nvdimm

On 09/01/20 20:47, Barret Rhoden wrote:
> Hi -
> 
> On 1/8/20 3:24 PM, Sean Christopherson wrote:
>> This series is a mix of bug fixes, cleanup and new support in KVM's
>> handling of huge pages.  The series initially stemmed from a syzkaller
>> bug report[1], which is fixed by patch 02, "mm: thp: KVM: Explicitly
>> check for THP when populating secondary MMU".
>>
>> While investigating options for fixing the syzkaller bug, I realized KVM
>> could reuse the approach from Barret's series to enable huge pages for
>> DAX
>> mappings in KVM[2] for all types of huge mappings, i.e. walk the host
>> page
>> tables instead of querying metadata (patches 05 - 09).
> 
> Thanks, Sean.  I tested this patch series out, and it works for me.
> (Huge KVM mappings of a DAX file, etc.).

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 12/14] KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()
  2020-01-08 20:24 ` [PATCH 12/14] KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust() Sean Christopherson
@ 2020-01-21 15:12   ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2020-01-21 15:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paul Mackerras, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Andrew Morton, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, kvm-ppc, kvm, linux-kernel, linux-mm,
	linux-arm-kernel, kvmarm, syzbot+c9d1fb51ac9d0d10c39d,
	Andrea Arcangeli, Dan Williams, Barret Rhoden, David Hildenbrand,
	Jason Zeng, Dave Jiang, Liran Alon, linux-nvdimm

On 08/01/20 21:24, Sean Christopherson wrote:
> -	level = host_pfn_mapping_level(vcpu, gfn, pfn);
> +	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +	if (!memslot_valid_for_gpte(slot, true))
> +		return PT_PAGE_TABLE_LEVEL;

Following up on my remark to patch 7, this can also use
gfn_to_memslot_dirty_bitmap.

Paolo

> +
> +	max_level = min(max_level, kvm_x86_ops->get_lpage_level());
> +	for ( ; max_level > PT_PAGE_TABLE_LEVEL; max_level--) {
> +		if (!__mmu_gfn_lpage_is_disallowed(gfn, max_level, slot))
> +			break;
> +	}
> +
> +	if (max_level == PT_PAGE_TABLE_LEVEL)
> +		return PT_PAGE_TABLE_LEVEL;
> +
> +	level = host_pfn_mapping_level(vcpu, gfn, pfn, slot);
>  	if (level == PT_PAGE_TABLE_LEVEL)
>  		return level;
>  
> @@ -4182,8 +4172,6 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	if (lpage_disallowed)
>  		max_level = PT_PAGE_TABLE_LEVEL;
>  
> -	max_level = max_mapping_level(vcpu, gfn, max_level);
> -
>  	if (fast_page_fault(vcpu, gpa, error_code))
>  		return RET_PF_RETRY;
>  
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 0560982eda8b..ea174d85700a 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -817,8 +817,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>  	else
>  		max_level = walker.level;
>  
> -	max_level = max_mapping_level(vcpu, walker.gfn, max_level);
> -
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
>  
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-01-21 15:12 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-08 20:24 [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Sean Christopherson
2020-01-08 20:24 ` [PATCH 01/14] KVM: x86/mmu: Enforce max_level on HugeTLB mappings Sean Christopherson
2020-01-08 20:24 ` [PATCH 02/14] mm: thp: KVM: Explicitly check for THP when populating secondary MMU Sean Christopherson
2020-01-08 20:24 ` [PATCH 03/14] KVM: Use vcpu-specific gva->hva translation when querying host page size Sean Christopherson
2020-01-08 20:24 ` [PATCH 04/14] KVM: Play nice with read-only memslots " Sean Christopherson
2020-01-21 14:24   ` Paolo Bonzini
2020-01-08 20:24 ` [PATCH 05/14] x86/mm: Introduce lookup_address_in_mm() Sean Christopherson
2020-01-09 21:04   ` Thomas Gleixner
2020-01-21 14:26     ` Paolo Bonzini
2020-01-08 20:24 ` [PATCH 06/14] KVM: x86/mmu: Refactor THP adjust to prep for changing query Sean Christopherson
2020-01-08 20:24 ` [PATCH 07/14] KVM: x86/mmu: Walk host page tables to find THP mappings Sean Christopherson
2020-01-21 14:40   ` Paolo Bonzini
2020-01-08 20:24 ` [PATCH 08/14] KVM: x86/mmu: Drop level optimization from fast_page_fault() Sean Christopherson
2020-01-08 20:24 ` [PATCH 09/14] KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings Sean Christopherson
2020-01-08 20:24 ` [PATCH 10/14] KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch) Sean Christopherson
2020-01-08 20:24 ` [PATCH 11/14] KVM: x86/mmu: Zap any compound page when collapsing sptes Sean Christopherson
2020-01-08 20:24 ` [PATCH 12/14] KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust() Sean Christopherson
2020-01-21 15:12   ` Paolo Bonzini
2020-01-08 20:24 ` [PATCH 13/14] KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte() Sean Christopherson
2020-01-08 20:24 ` [PATCH 14/14] KVM: x86/mmu: Use huge pages for DAX-backed files Sean Christopherson
2020-01-09 19:47 ` [PATCH 00/14] KVM: x86/mmu: Huge page fixes, cleanup, and DAX Barret Rhoden
2020-01-21 15:10   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).