[patch 0/5] KVM: support for pinning sptes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 0/5] KVM: support for pinning sptes
@ 2014-06-18 23:12 mtosatti
  2014-06-18 23:12 ` [patch 1/5] KVM: x86: add pinned parameter to page_fault methods mtosatti
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: mtosatti @ 2014-06-18 23:12 UTC (permalink / raw)
  To: kvm, ak; +Cc: pbonzini, xiaoguangrong, gleb, avi

Required by PEBS support as discussed at 

Subject: [patch 0/5] Implement PEBS virtualization for Silvermont
Message-Id: <1401412327-14810-1-git-send-email-andi@firstfloor.org>

Thread.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 1/5] KVM: x86: add pinned parameter to page_fault methods
  2014-06-18 23:12 [patch 0/5] KVM: support for pinning sptes mtosatti
@ 2014-06-18 23:12 ` mtosatti
  2014-06-18 23:12 ` [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only) mtosatti
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: mtosatti @ 2014-06-18 23:12 UTC (permalink / raw)
  To: kvm, ak; +Cc: pbonzini, xiaoguangrong, gleb, avi, Marcelo Tosatti

[-- Attachment #1: add-pinned-parameter-to-page-fault --]
[-- Type: text/plain, Size: 3675 bytes --]

To be used by next patch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/include/asm/kvm_host.h |    2 +-
 arch/x86/kvm/mmu.c              |   11 ++++++-----
 arch/x86/kvm/paging_tmpl.h      |    2 +-
 arch/x86/kvm/x86.c              |    2 +-
 4 files changed, 9 insertions(+), 8 deletions(-)

Index: kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/include/asm/kvm_host.h	2014-06-18 17:27:47.579549247 -0300
+++ kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h	2014-06-18 17:28:17.549456614 -0300
@@ -259,7 +259,7 @@
 	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
 	u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index);
 	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err,
-			  bool prefault);
+			  bool prefault, bool pin, bool *pinned);
 	void (*inject_page_fault)(struct kvm_vcpu *vcpu,
 				  struct x86_exception *fault);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-18 17:27:47.582549238 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-18 17:28:17.550456611 -0300
@@ -2899,7 +2899,7 @@
 static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
-			 gfn_t gfn, bool prefault)
+			 gfn_t gfn, bool prefault, bool pin, bool *pinned)
 {
 	int r;
 	int level;
@@ -3299,7 +3299,8 @@
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-				u32 error_code, bool prefault)
+				u32 error_code, bool prefault, bool pin,
+				bool *pinned)
 {
 	gfn_t gfn;
 	int r;
@@ -3323,7 +3324,7 @@
 	gfn = gva >> PAGE_SHIFT;
 
 	return nonpaging_map(vcpu, gva & PAGE_MASK,
-			     error_code, gfn, prefault);
+			     error_code, gfn, prefault, pin, pinned);
 }
 
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
@@ -3373,7 +3374,7 @@
 }
 
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
-			  bool prefault)
+			  bool prefault, bool pin, bool *pinned)
 {
 	pfn_t pfn;
 	int r;
@@ -4190,7 +4191,7 @@
 	int r, emulation_type = EMULTYPE_RETRY;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false, false, NULL);
 	if (r < 0)
 		goto out;
 
Index: kvm.pinned-sptes/arch/x86/kvm/paging_tmpl.h
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/paging_tmpl.h	2014-06-18 17:27:47.583549234 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/paging_tmpl.h	2014-06-18 17:28:17.550456611 -0300
@@ -687,7 +687,7 @@
  *           a negative value on error.
  */
 static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
-			     bool prefault)
+			     bool prefault, bool pin, bool *pinned)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
Index: kvm.pinned-sptes/arch/x86/kvm/x86.c
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/x86.c	2014-06-18 17:27:47.586549225 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/x86.c	2014-06-18 17:28:17.552456605 -0300
@@ -7415,7 +7415,7 @@
 	      work->arch.cr3 != vcpu->arch.mmu.get_cr3(vcpu))
 		return;
 
-	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
+	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true, false, NULL);
 }
 
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-18 23:12 [patch 0/5] KVM: support for pinning sptes mtosatti
  2014-06-18 23:12 ` [patch 1/5] KVM: x86: add pinned parameter to page_fault methods mtosatti
@ 2014-06-18 23:12 ` mtosatti
  2014-06-19  7:21   ` Gleb Natapov
                     ` (2 more replies)
  2014-06-18 23:12 ` [patch 3/5] KVM: MMU: notifiers support for pinned sptes mtosatti
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 27+ messages in thread
From: mtosatti @ 2014-06-18 23:12 UTC (permalink / raw)
  To: kvm, ak; +Cc: pbonzini, xiaoguangrong, gleb, avi, Marcelo Tosatti

[-- Attachment #1: mmu-pinned-spte --]
[-- Type: text/plain, Size: 13207 bytes --]

Allow vcpus to pin spte translations by:

1) Creating a per-vcpu list of pinned ranges.
2) On mmu reload request:
	- Fault ranges.
	- Mark sptes with a pinned bit.
	- Mark shadow pages as pinned.

3) Then modify the following actions:
	- Page age => skip spte flush.
	- MMU notifiers => force mmu reload request (which kicks cpu out of
				guest mode).
	- GET_DIRTY_LOG => force mmu reload request.
	- SLAB shrinker => skip shadow page deletion.

TDP-only.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/include/asm/kvm_host.h |   14 ++
 arch/x86/kvm/mmu.c              |  202 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h              |    5 
 arch/x86/kvm/mmutrace.h         |   23 ++++
 arch/x86/kvm/paging_tmpl.h      |    2 
 arch/x86/kvm/x86.c              |    4 
 6 files changed, 241 insertions(+), 9 deletions(-)

Index: kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/include/asm/kvm_host.h	2014-06-18 17:28:17.549456614 -0300
+++ kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h	2014-06-18 17:28:24.338435658 -0300
@@ -221,6 +221,8 @@
 	/* hold the gfn of each spte inside spt */
 	gfn_t *gfns;
 	bool unsync;
+	bool pinned;
+
 	int root_count;          /* Currently serving as active root */
 	unsigned int unsync_children;
 	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
@@ -337,6 +339,14 @@
 	KVM_DEBUGREG_WONT_EXIT = 2,
 };
 
+struct kvm_pinned_page_range {
+	gfn_t base_gfn;
+	unsigned long npages;
+	struct list_head link;
+};
+
+#define KVM_MAX_PER_VCPU_PINNED_RANGE 10
+
 struct kvm_vcpu_arch {
 	/*
 	 * rip and regs accesses must go through
@@ -392,6 +402,10 @@
 	struct kvm_mmu_memory_cache mmu_page_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
+	struct list_head pinned_mmu_pages;
+	struct mutex pinned_mmu_mutex;
+	unsigned int nr_pinned_ranges;
+
 	struct fpu guest_fpu;
 	u64 xcr0;
 	u64 guest_supported_xcr0;
Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-18 17:28:17.550456611 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-18 17:28:24.339435654 -0300
@@ -148,6 +148,9 @@
 
 #define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
 #define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
+#define SPTE_PINNED		(1ULL << (PT64_SECOND_AVAIL_BITS_SHIFT))
+
+#define SPTE_PINNED_BIT PT64_SECOND_AVAIL_BITS_SHIFT
 
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
@@ -327,6 +330,11 @@
 	return pte & PT_PRESENT_MASK && !is_mmio_spte(pte);
 }
 
+static int is_pinned_spte(u64 spte)
+{
+	return spte & SPTE_PINNED && is_shadow_present_pte(spte);
+}
+
 static int is_large_pte(u64 pte)
 {
 	return pte & PT_PAGE_SIZE_MASK;
@@ -2818,7 +2826,7 @@
  * - false: let the real page fault path to fix it.
  */
 static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
-			    u32 error_code)
+			    u32 error_code, bool pin)
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
@@ -2828,6 +2836,9 @@
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return false;
 
+	if (pin)
+		return false;
+
 	if (!page_fault_can_be_fast(error_code))
 		return false;
 
@@ -2895,9 +2906,55 @@
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
+			 gva_t gva, pfn_t *pfn, bool write, bool *writable,
+			 bool pin);
 static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
+
+static int get_sptep_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes[4])
+{
+	struct kvm_shadow_walk_iterator iterator;
+	int nr_sptes = 0;
+
+	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
+		return nr_sptes;
+
+	for_each_shadow_entry(vcpu, addr, iterator) {
+		sptes[iterator.level-1] = iterator.sptep;
+		nr_sptes++;
+		if (!is_shadow_present_pte(*iterator.sptep))
+			break;
+	}
+
+	return nr_sptes;
+}
+
+static bool direct_pin_sptes(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u64 *sptes[4];
+	int r, i, level;
+
+	r = get_sptep_hierarchy(vcpu, gfn << PAGE_SHIFT, sptes);
+	if (!r)
+		return false;
+
+	level = 5 - r;
+	if (!is_last_spte(*sptes[r-1], level))
+		return false;
+	if (!is_shadow_present_pte(*sptes[r-1]))
+		return false;
+
+	for (i = 0; i < r; i++) {
+		u64 *sptep = sptes[i];
+		struct kvm_mmu_page *sp = page_header(__pa(sptep));
+
+		sp->pinned = true;
+		set_bit(SPTE_PINNED_BIT, (unsigned long *)sptep);
+	}
+
+	return true;
+}
+
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 			 gfn_t gfn, bool prefault, bool pin, bool *pinned)
 {
@@ -2923,13 +2980,14 @@
 	} else
 		level = PT_PAGE_TABLE_LEVEL;
 
-	if (fast_page_fault(vcpu, v, level, error_code))
+	if (fast_page_fault(vcpu, v, level, error_code, pin))
 		return 0;
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
+	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable,
+			 pin))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r))
@@ -2943,6 +3001,8 @@
 		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
 	r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn,
 			 prefault);
+	if (pin)
+		*pinned = direct_pin_sptes(vcpu, gfn);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
 
@@ -3349,7 +3409,8 @@
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable)
+			 gva_t gva, pfn_t *pfn, bool write, bool *writable,
+			 bool pin)
 {
 	bool async;
 
@@ -3358,7 +3419,7 @@
 	if (!async)
 		return false; /* *pfn has correct page already */
 
-	if (!prefault && can_do_async_pf(vcpu)) {
+	if (!prefault && !pin && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(gva, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(gva, gfn);
@@ -3406,13 +3467,14 @@
 	} else
 		level = PT_PAGE_TABLE_LEVEL;
 
-	if (fast_page_fault(vcpu, gpa, level, error_code))
+	if (fast_page_fault(vcpu, gpa, level, error_code, pin))
 		return 0;
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable,
+			 pin))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r))
@@ -3426,6 +3488,8 @@
 		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
 	r = __direct_map(vcpu, gpa, write, map_writable,
 			 level, gfn, pfn, prefault);
+	if (pin)
+		*pinned = direct_pin_sptes(vcpu, gfn);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
 	return r;
@@ -3903,6 +3967,127 @@
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
 
+int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
+				  gfn_t base_gfn, unsigned long npages)
+{
+	struct kvm_pinned_page_range *p;
+
+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
+		if (p->base_gfn == base_gfn && p->npages == npages) {
+			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+			return -EEXIST;
+		}
+	}
+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+
+	if (vcpu->arch.nr_pinned_ranges >=
+	    KVM_MAX_PER_VCPU_PINNED_RANGE)
+		return -ENOSPC;
+
+	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+
+	vcpu->arch.nr_pinned_ranges++;
+
+	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
+
+	INIT_LIST_HEAD(&p->link);
+	p->base_gfn = base_gfn;
+	p->npages = npages;
+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
+	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+
+	return 0;
+}
+
+int kvm_mmu_unregister_pinned_range(struct kvm_vcpu *vcpu,
+				    gfn_t base_gfn, unsigned long npages)
+{
+	struct kvm_pinned_page_range *p;
+
+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
+		if (p->base_gfn == base_gfn && p->npages == npages) {
+			list_del(&p->link);
+			vcpu->arch.nr_pinned_ranges--;
+			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+			kfree(p);
+			return 0;
+		}
+	}
+
+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+	return -ENOENT;
+}
+
+void kvm_mmu_free_pinned_ranges(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pinned_page_range *p, *p2;
+
+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
+	list_for_each_entry_safe(p, p2, &vcpu->arch.pinned_mmu_pages, link) {
+		list_del(&p->link);
+		kfree(p);
+	}
+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+}
+
+/*
+ * Pin KVM MMU page translations. This guarantees, for valid
+ * addresses registered by kvm_mmu_register_pinned_range (valid address
+ * meaning address which posses sufficient information for fault to
+ * be resolved), valid translations exist while in guest mode and
+ * therefore no VM-exits due to faults will occur.
+ *
+ * Failure to instantiate pages will abort guest entry.
+ *
+ * Page frames should be pinned with get_page in advance.
+ *
+ * Pinning is not guaranteed while executing as L2 guest.
+ *
+ */
+
+static void kvm_mmu_pin_pages(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pinned_page_range *p;
+
+	if (is_guest_mode(vcpu))
+		return;
+
+	if (!vcpu->arch.mmu.direct_map)
+		return;
+
+	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
+
+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
+		gfn_t gfn_offset;
+
+		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
+			gfn_t gfn = p->base_gfn + gfn_offset;
+			int r;
+			bool pinned = false;
+
+			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
+						     PFERR_WRITE_MASK, false,
+						     true, &pinned);
+			/* MMU notifier sequence window: retry */
+			if (!r && !pinned)
+				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+			if (r) {
+				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+				break;
+			}
+
+		}
+	}
+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+}
+
 int kvm_mmu_load(struct kvm_vcpu *vcpu)
 {
 	int r;
@@ -3916,6 +4101,7 @@
 		goto out;
 	/* set_cr3() should ensure TLB has been flushed */
 	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
+	kvm_mmu_pin_pages(vcpu);
 out:
 	return r;
 }
Index: kvm.pinned-sptes/arch/x86/kvm/mmu.h
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.h	2014-06-18 17:27:47.582549238 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/mmu.h	2014-06-18 17:28:24.339435654 -0300
@@ -178,4 +178,9 @@
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
+				 gfn_t base_gfn, unsigned long npages);
+int kvm_mmu_unregister_pinned_range(struct kvm_vcpu *vcpu,
+				 gfn_t base_gfn, unsigned long npages);
+void kvm_mmu_free_pinned_ranges(struct kvm_vcpu *vcpu);
 #endif
Index: kvm.pinned-sptes/arch/x86/kvm/x86.c
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/x86.c	2014-06-18 17:28:17.552456605 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/x86.c	2014-06-18 17:28:24.340435651 -0300
@@ -7049,6 +7049,8 @@
 
 	kvm_async_pf_hash_reset(vcpu);
 	kvm_pmu_init(vcpu);
+	INIT_LIST_HEAD(&vcpu->arch.pinned_mmu_pages);
+	mutex_init(&vcpu->arch.pinned_mmu_mutex);
 
 	return 0;
 fail_free_wbinvd_dirty_mask:
@@ -7069,6 +7071,7 @@
 {
 	int idx;
 
+	kvm_mmu_free_pinned_ranges(vcpu);
 	kvm_pmu_destroy(vcpu);
 	kfree(vcpu->arch.mce_banks);
 	kvm_free_lapic(vcpu);
@@ -7113,6 +7116,7 @@
 	int r;
 	r = vcpu_load(vcpu);
 	BUG_ON(r);
+	kvm_mmu_free_pinned_ranges(vcpu);
 	kvm_mmu_unload(vcpu);
 	vcpu_put(vcpu);
 }
Index: kvm.pinned-sptes/arch/x86/kvm/paging_tmpl.h
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/paging_tmpl.h	2014-06-18 17:28:17.550456611 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/paging_tmpl.h	2014-06-18 17:28:24.340435651 -0300
@@ -747,7 +747,7 @@
 	smp_rmb();
 
 	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
-			 &map_writable))
+			 &map_writable, false))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
Index: kvm.pinned-sptes/arch/x86/kvm/mmutrace.h
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/mmutrace.h	2014-06-18 17:27:47.583549234 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/mmutrace.h	2014-06-18 17:28:24.340435651 -0300
@@ -322,6 +322,29 @@
 		  __entry->kvm_gen == __entry->spte_gen
 	)
 );
+
+TRACE_EVENT(
+	kvm_mmu_register_pinned_range,
+	TP_PROTO(unsigned int vcpu_id, gfn_t gfn, unsigned long npages),
+	TP_ARGS(vcpu_id, gfn, npages),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id	)
+		__field(	gfn_t,		gfn	)
+		__field(	unsigned long,	npages	)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id		= vcpu_id;
+		__entry->gfn			= gfn;
+		__entry->npages			= npages;
+	),
+
+	TP_printk("vcpu_id %u gfn %llx npages %lx",
+		  __entry->vcpu_id,
+		  __entry->gfn,
+		  __entry->npages)
+);
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 3/5] KVM: MMU: notifiers support for pinned sptes
  2014-06-18 23:12 [patch 0/5] KVM: support for pinning sptes mtosatti
  2014-06-18 23:12 ` [patch 1/5] KVM: x86: add pinned parameter to page_fault methods mtosatti
  2014-06-18 23:12 ` [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only) mtosatti
@ 2014-06-18 23:12 ` mtosatti
  2014-06-19  6:48   ` Gleb Natapov
  2014-06-18 23:12 ` [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path mtosatti
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: mtosatti @ 2014-06-18 23:12 UTC (permalink / raw)
  To: kvm, ak; +Cc: pbonzini, xiaoguangrong, gleb, avi, Marcelo Tosatti

[-- Attachment #1: mmu-notifier-handle-pinned --]
[-- Type: text/plain, Size: 4606 bytes --]

Request KVM_REQ_MMU_RELOAD when deleting sptes from MMU notifiers.

Keep pinned sptes intact if page aging.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/kvm/mmu.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 62 insertions(+), 9 deletions(-)

Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-18 17:28:24.339435654 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-18 17:29:32.510225755 -0300
@@ -1184,6 +1184,42 @@
 		kvm_flush_remote_tlbs(vcpu->kvm);
 }
 
+static void ack_flush(void *_completed)
+{
+}
+
+static void mmu_reload_pinned_vcpus(struct kvm *kvm)
+{
+	int i, cpu, me;
+	cpumask_var_t cpus;
+	struct kvm_vcpu *vcpu;
+	unsigned int req = KVM_REQ_MMU_RELOAD;
+
+	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
+
+	me = get_cpu();
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (list_empty(&vcpu->arch.pinned_mmu_pages))
+			continue;
+		kvm_make_request(req, vcpu);
+		cpu = vcpu->cpu;
+
+		/* Set ->requests bit before we read ->mode */
+		smp_mb();
+
+		if (cpus != NULL && cpu != -1 && cpu != me &&
+		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
+			cpumask_set_cpu(cpu, cpus);
+	}
+	if (unlikely(cpus == NULL))
+		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
+	else if (!cpumask_empty(cpus))
+		smp_call_function_many(cpus, ack_flush, NULL, 1);
+	put_cpu();
+	free_cpumask_var(cpus);
+	return;
+}
+
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
  * spte write-protection is caused by protecting shadow page table.
@@ -1276,7 +1312,8 @@
 }
 
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
-			   struct kvm_memory_slot *slot, unsigned long data)
+			   struct kvm_memory_slot *slot, unsigned long data,
+			   bool age)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1286,6 +1323,14 @@
 		BUG_ON(!(*sptep & PT_PRESENT_MASK));
 		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", sptep, *sptep);
 
+		if (is_pinned_spte(*sptep)) {
+			/* don't nuke pinned sptes if page aging: return
+ 			 * young=yes instead.
+ 			 */
+			if (age)
+				return 1;
+			mmu_reload_pinned_vcpus(kvm);
+		}
 		drop_spte(kvm, sptep);
 		need_tlb_flush = 1;
 	}
@@ -1294,7 +1339,8 @@
 }
 
 static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
-			     struct kvm_memory_slot *slot, unsigned long data)
+			     struct kvm_memory_slot *slot, unsigned long data,
+			     bool age)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1312,6 +1358,9 @@
 
 		need_flush = 1;
 
+		if (is_pinned_spte(*sptep))
+			mmu_reload_pinned_vcpus(kvm);
+
 		if (pte_write(*ptep)) {
 			drop_spte(kvm, sptep);
 			sptep = rmap_get_first(*rmapp, &iter);
@@ -1342,7 +1391,8 @@
 				int (*handler)(struct kvm *kvm,
 					       unsigned long *rmapp,
 					       struct kvm_memory_slot *slot,
-					       unsigned long data))
+					       unsigned long data,
+					       bool age))
 {
 	int j;
 	int ret = 0;
@@ -1382,7 +1432,7 @@
 			rmapp = __gfn_to_rmap(gfn_start, j, memslot);
 
 			for (; idx <= idx_end; ++idx)
-				ret |= handler(kvm, rmapp++, memslot, data);
+				ret |= handler(kvm, rmapp++, memslot, data, false);
 		}
 	}
 
@@ -1393,7 +1443,8 @@
 			  unsigned long data,
 			  int (*handler)(struct kvm *kvm, unsigned long *rmapp,
 					 struct kvm_memory_slot *slot,
-					 unsigned long data))
+					 unsigned long data,
+					 bool age))
 {
 	return kvm_handle_hva_range(kvm, hva, hva + 1, data, handler);
 }
@@ -1414,7 +1465,8 @@
 }
 
 static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
-			 struct kvm_memory_slot *slot, unsigned long data)
+			 struct kvm_memory_slot *slot, unsigned long data,
+			 bool age)
 {
 	u64 *sptep;
 	struct rmap_iterator uninitialized_var(iter);
@@ -1429,7 +1481,7 @@
 	 * out actively used pages or breaking up actively used hugepages.
 	 */
 	if (!shadow_accessed_mask) {
-		young = kvm_unmap_rmapp(kvm, rmapp, slot, data);
+		young = kvm_unmap_rmapp(kvm, rmapp, slot, data, true);
 		goto out;
 	}
 
@@ -1450,7 +1502,8 @@
 }
 
 static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
-			      struct kvm_memory_slot *slot, unsigned long data)
+			      struct kvm_memory_slot *slot, unsigned long data,
+			      bool age)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1488,7 +1541,7 @@
 
 	rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp->role.level);
 
-	kvm_unmap_rmapp(vcpu->kvm, rmapp, NULL, 0);
+	kvm_unmap_rmapp(vcpu->kvm, rmapp, NULL, 0, false);
 	kvm_flush_remote_tlbs(vcpu->kvm);
 }
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
  2014-06-18 23:12 [patch 0/5] KVM: support for pinning sptes mtosatti
                   ` (2 preceding siblings ...)
  2014-06-18 23:12 ` [patch 3/5] KVM: MMU: notifiers support for pinned sptes mtosatti
@ 2014-06-18 23:12 ` mtosatti
  2014-06-19  8:17   ` Gleb Natapov
  2014-06-18 23:12 ` [patch 5/5] KVM: MMU: pinned sps are not candidates for deletion mtosatti
  2014-06-19  1:44 ` [patch 0/5] KVM: support for pinning sptes Andi Kleen
  5 siblings, 1 reply; 27+ messages in thread
From: mtosatti @ 2014-06-18 23:12 UTC (permalink / raw)
  To: kvm, ak; +Cc: pbonzini, xiaoguangrong, gleb, avi, Marcelo Tosatti

[-- Attachment #1: mmu-get-dirty-log-pinned --]
[-- Type: text/plain, Size: 676 bytes --]

Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
deleting a pinned spte.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/kvm/mmu.c |    3 +++
 1 file changed, 3 insertions(+)

Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
===================================================================
--- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-13 16:50:50.040140594 -0300
+++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-13 16:51:05.620104451 -0300
@@ -1247,6 +1247,9 @@
 		spte &= ~SPTE_MMU_WRITEABLE;
 	spte = spte & ~PT_WRITABLE_MASK;
 
+	if (is_pinned_spte(spte))
+		mmu_reload_pinned_vcpus(kvm);
+
 	return mmu_spte_update(sptep, spte);
 }
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 5/5] KVM: MMU: pinned sps are not candidates for deletion.
  2014-06-18 23:12 [patch 0/5] KVM: support for pinning sptes mtosatti
                   ` (3 preceding siblings ...)
  2014-06-18 23:12 ` [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path mtosatti
@ 2014-06-18 23:12 ` mtosatti
  2014-06-19  1:44 ` [patch 0/5] KVM: support for pinning sptes Andi Kleen
  5 siblings, 0 replies; 27+ messages in thread
From: mtosatti @ 2014-06-18 23:12 UTC (permalink / raw)
  To: kvm, ak; +Cc: pbonzini, xiaoguangrong, gleb, avi, Marcelo Tosatti

[-- Attachment #1: mmu-pinned-spte-skip --]
[-- Type: text/plain, Size: 1703 bytes --]

Skip pinned shadow pages when selecting pages to zap.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/kvm/mmu.c |   26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

Index: kvm/arch/x86/kvm/mmu.c
===================================================================
--- kvm.orig/arch/x86/kvm/mmu.c
+++ kvm/arch/x86/kvm/mmu.c
@@ -2279,16 +2279,24 @@ static void kvm_mmu_commit_zap_page(stru
 static bool prepare_zap_oldest_mmu_page(struct kvm *kvm,
 					struct list_head *invalid_list)
 {
-	struct kvm_mmu_page *sp;
-
-	if (list_empty(&kvm->arch.active_mmu_pages))
-		return false;
+	struct kvm_mmu_page *sp, *nsp;
+	LIST_HEAD(pinned_list);
 
-	sp = list_entry(kvm->arch.active_mmu_pages.prev,
-			struct kvm_mmu_page, link);
-	kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+	list_for_each_entry_safe_reverse(sp, nsp,
+					 &kvm->arch.active_mmu_pages, link) {
+		if (sp->pinned) {
+			list_move(&sp->link, &pinned_list);
+			continue;
+		}
+		if (!list_empty(&pinned_list))
+			list_move(&pinned_list, &kvm->arch.active_mmu_pages);
+		kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+		return true;
+	}
 
-	return true;
+	if (!list_empty(&pinned_list))
+		list_move(&pinned_list, &kvm->arch.active_mmu_pages);
+	return false;
 }
 
 /*
@@ -4660,6 +4668,8 @@ void kvm_mmu_invalidate_zap_all_pages(st
 	 * Notify all vcpus to reload its shadow page table
 	 * and flush TLB. Then all vcpus will switch to new
 	 * shadow page table with the new mmu_valid_gen.
+	 * MMU reload request also forces fault of
+	 * sptes for pinned ranges.
 	 *
 	 * Note: we should do this under the protection of
 	 * mmu-lock, otherwise, vcpu would purge shadow page



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/5] KVM: support for pinning sptes
  2014-06-18 23:12 [patch 0/5] KVM: support for pinning sptes mtosatti
                   ` (4 preceding siblings ...)
  2014-06-18 23:12 ` [patch 5/5] KVM: MMU: pinned sps are not candidates for deletion mtosatti
@ 2014-06-19  1:44 ` Andi Kleen
  5 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2014-06-19  1:44 UTC (permalink / raw)
  To: mtosatti; +Cc: kvm, pbonzini, xiaoguangrong, gleb, avi

On Wed, Jun 18, 2014 at 08:12:03PM -0300, mtosatti@redhat.com wrote:
> Required by PEBS support as discussed at 
> 
> Subject: [patch 0/5] Implement PEBS virtualization for Silvermont
> Message-Id: <1401412327-14810-1-git-send-email-andi@firstfloor.org>

Thanks marcelo. I'll give it a stress test here.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 3/5] KVM: MMU: notifiers support for pinned sptes
  2014-06-18 23:12 ` [patch 3/5] KVM: MMU: notifiers support for pinned sptes mtosatti
@ 2014-06-19  6:48   ` Gleb Natapov
  2014-06-19 18:28     ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Gleb Natapov @ 2014-06-19  6:48 UTC (permalink / raw)
  To: mtosatti; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Wed, Jun 18, 2014 at 08:12:06PM -0300, mtosatti@redhat.com wrote:
> Request KVM_REQ_MMU_RELOAD when deleting sptes from MMU notifiers.
> 
> Keep pinned sptes intact if page aging.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  arch/x86/kvm/mmu.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 62 insertions(+), 9 deletions(-)
> 
> Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> ===================================================================
> --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-18 17:28:24.339435654 -0300
> +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-18 17:29:32.510225755 -0300
> @@ -1184,6 +1184,42 @@
>  		kvm_flush_remote_tlbs(vcpu->kvm);
>  }
>  
> +static void ack_flush(void *_completed)
> +{
> +}
> +
> +static void mmu_reload_pinned_vcpus(struct kvm *kvm)
> +{
> +	int i, cpu, me;
> +	cpumask_var_t cpus;
> +	struct kvm_vcpu *vcpu;
> +	unsigned int req = KVM_REQ_MMU_RELOAD;
> +
> +	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
> +
> +	me = get_cpu();
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (list_empty(&vcpu->arch.pinned_mmu_pages))
> +			continue;
> +		kvm_make_request(req, vcpu);
> +		cpu = vcpu->cpu;
> +
> +		/* Set ->requests bit before we read ->mode */
> +		smp_mb();
> +
> +		if (cpus != NULL && cpu != -1 && cpu != me &&
> +		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
> +			cpumask_set_cpu(cpu, cpus);
> +	}
> +	if (unlikely(cpus == NULL))
> +		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
> +	else if (!cpumask_empty(cpus))
> +		smp_call_function_many(cpus, ack_flush, NULL, 1);
> +	put_cpu();
> +	free_cpumask_var(cpus);
> +	return;
> +}
This is a c&p of make_all_cpus_request(), the only difference is checking
of vcpu->arch.pinned_mmu_pages.  You can add make_some_cpus_request(..., bool (*predicate)(struct kvm_vcpu *))
to kvm_main.c and rewrite make_all_cpus_request() to use it instead.

> +
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
>   * spte write-protection is caused by protecting shadow page table.
> @@ -1276,7 +1312,8 @@
>  }
>  
>  static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
> -			   struct kvm_memory_slot *slot, unsigned long data)
> +			   struct kvm_memory_slot *slot, unsigned long data,
> +			   bool age)
>  {
>  	u64 *sptep;
>  	struct rmap_iterator iter;
> @@ -1286,6 +1323,14 @@
>  		BUG_ON(!(*sptep & PT_PRESENT_MASK));
>  		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", sptep, *sptep);
>  
> +		if (is_pinned_spte(*sptep)) {
> +			/* don't nuke pinned sptes if page aging: return
> + 			 * young=yes instead.
> + 			 */
> +			if (age)
> +				return 1;
> +			mmu_reload_pinned_vcpus(kvm);
> +		}
>  		drop_spte(kvm, sptep);
>  		need_tlb_flush = 1;
>  	}
> @@ -1294,7 +1339,8 @@
>  }
>  
>  static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
> -			     struct kvm_memory_slot *slot, unsigned long data)
> +			     struct kvm_memory_slot *slot, unsigned long data,
> +			     bool age)
>  {
>  	u64 *sptep;
>  	struct rmap_iterator iter;
> @@ -1312,6 +1358,9 @@
>  
>  		need_flush = 1;
>  
> +		if (is_pinned_spte(*sptep))
> +			mmu_reload_pinned_vcpus(kvm);
> +
>  		if (pte_write(*ptep)) {
>  			drop_spte(kvm, sptep);
>  			sptep = rmap_get_first(*rmapp, &iter);
> @@ -1342,7 +1391,8 @@
>  				int (*handler)(struct kvm *kvm,
>  					       unsigned long *rmapp,
>  					       struct kvm_memory_slot *slot,
> -					       unsigned long data))
> +					       unsigned long data,
> +					       bool age))
>  {
>  	int j;
>  	int ret = 0;
> @@ -1382,7 +1432,7 @@
>  			rmapp = __gfn_to_rmap(gfn_start, j, memslot);
>  
>  			for (; idx <= idx_end; ++idx)
> -				ret |= handler(kvm, rmapp++, memslot, data);
> +				ret |= handler(kvm, rmapp++, memslot, data, false);
>  		}
>  	}
>  
> @@ -1393,7 +1443,8 @@
>  			  unsigned long data,
>  			  int (*handler)(struct kvm *kvm, unsigned long *rmapp,
>  					 struct kvm_memory_slot *slot,
> -					 unsigned long data))
> +					 unsigned long data,
> +					 bool age))
>  {
>  	return kvm_handle_hva_range(kvm, hva, hva + 1, data, handler);
>  }
> @@ -1414,7 +1465,8 @@
>  }
>  
>  static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
> -			 struct kvm_memory_slot *slot, unsigned long data)
> +			 struct kvm_memory_slot *slot, unsigned long data,
> +			 bool age)
>  {
>  	u64 *sptep;
>  	struct rmap_iterator uninitialized_var(iter);
> @@ -1429,7 +1481,7 @@
>  	 * out actively used pages or breaking up actively used hugepages.
>  	 */
>  	if (!shadow_accessed_mask) {
> -		young = kvm_unmap_rmapp(kvm, rmapp, slot, data);
> +		young = kvm_unmap_rmapp(kvm, rmapp, slot, data, true);
>  		goto out;
>  	}
>  
> @@ -1450,7 +1502,8 @@
>  }
>  
>  static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
> -			      struct kvm_memory_slot *slot, unsigned long data)
> +			      struct kvm_memory_slot *slot, unsigned long data,
> +			      bool age)
>  {
>  	u64 *sptep;
>  	struct rmap_iterator iter;
> @@ -1488,7 +1541,7 @@
>  
>  	rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp->role.level);
>  
> -	kvm_unmap_rmapp(vcpu->kvm, rmapp, NULL, 0);
> +	kvm_unmap_rmapp(vcpu->kvm, rmapp, NULL, 0, false);
>  	kvm_flush_remote_tlbs(vcpu->kvm);
>  }
>  
> 
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-18 23:12 ` [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only) mtosatti
@ 2014-06-19  7:21   ` Gleb Natapov
  2014-06-19 19:22     ` Marcelo Tosatti
  2014-06-19  8:01   ` Avi Kivity
  2014-07-02  0:58   ` Nadav Amit
  2 siblings, 1 reply; 27+ messages in thread
From: Gleb Natapov @ 2014-06-19  7:21 UTC (permalink / raw)
  To: mtosatti; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Wed, Jun 18, 2014 at 08:12:05PM -0300, mtosatti@redhat.com wrote:
> Allow vcpus to pin spte translations by:
> 
> 1) Creating a per-vcpu list of pinned ranges.
What if memory slot containing pinned range is going away?

> 2) On mmu reload request:
> 	- Fault ranges.
> 	- Mark sptes with a pinned bit.
Should also be marked "dirty" as per SDM:
 The three DS save area sections should be allocated from a non-paged pool, and marked accessed and dirty

Some comment below.

> 	- Mark shadow pages as pinned.
> 
> 3) Then modify the following actions:
> 	- Page age => skip spte flush.
> 	- MMU notifiers => force mmu reload request (which kicks cpu out of
> 				guest mode).
> 	- GET_DIRTY_LOG => force mmu reload request.
> 	- SLAB shrinker => skip shadow page deletion.
> 
> TDP-only.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  arch/x86/include/asm/kvm_host.h |   14 ++
>  arch/x86/kvm/mmu.c              |  202 ++++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu.h              |    5 
>  arch/x86/kvm/mmutrace.h         |   23 ++++
>  arch/x86/kvm/paging_tmpl.h      |    2 
>  arch/x86/kvm/x86.c              |    4 
>  6 files changed, 241 insertions(+), 9 deletions(-)
> 
> Index: kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h
> ===================================================================
> --- kvm.pinned-sptes.orig/arch/x86/include/asm/kvm_host.h	2014-06-18 17:28:17.549456614 -0300
> +++ kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h	2014-06-18 17:28:24.338435658 -0300
> @@ -221,6 +221,8 @@
>  	/* hold the gfn of each spte inside spt */
>  	gfn_t *gfns;
>  	bool unsync;
> +	bool pinned;
> +
>  	int root_count;          /* Currently serving as active root */
>  	unsigned int unsync_children;
>  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
> @@ -337,6 +339,14 @@
>  	KVM_DEBUGREG_WONT_EXIT = 2,
>  };
>  
> +struct kvm_pinned_page_range {
> +	gfn_t base_gfn;
> +	unsigned long npages;
> +	struct list_head link;
> +};
> +
> +#define KVM_MAX_PER_VCPU_PINNED_RANGE 10
> +
>  struct kvm_vcpu_arch {
>  	/*
>  	 * rip and regs accesses must go through
> @@ -392,6 +402,10 @@
>  	struct kvm_mmu_memory_cache mmu_page_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
>  
> +	struct list_head pinned_mmu_pages;
> +	struct mutex pinned_mmu_mutex;
> +	unsigned int nr_pinned_ranges;
> +
>  	struct fpu guest_fpu;
>  	u64 xcr0;
>  	u64 guest_supported_xcr0;
> Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> ===================================================================
> --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-18 17:28:17.550456611 -0300
> +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-18 17:28:24.339435654 -0300
> @@ -148,6 +148,9 @@
>  
>  #define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
>  #define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
> +#define SPTE_PINNED		(1ULL << (PT64_SECOND_AVAIL_BITS_SHIFT))
> +
> +#define SPTE_PINNED_BIT PT64_SECOND_AVAIL_BITS_SHIFT
>  
>  #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
>  
> @@ -327,6 +330,11 @@
>  	return pte & PT_PRESENT_MASK && !is_mmio_spte(pte);
>  }
>  
> +static int is_pinned_spte(u64 spte)
> +{
> +	return spte & SPTE_PINNED && is_shadow_present_pte(spte);
> +}
> +
>  static int is_large_pte(u64 pte)
>  {
>  	return pte & PT_PAGE_SIZE_MASK;
> @@ -2818,7 +2826,7 @@
>   * - false: let the real page fault path to fix it.
>   */
>  static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> -			    u32 error_code)
> +			    u32 error_code, bool pin)
>  {
>  	struct kvm_shadow_walk_iterator iterator;
>  	struct kvm_mmu_page *sp;
> @@ -2828,6 +2836,9 @@
>  	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
>  		return false;
>  
> +	if (pin)
> +		return false;
> +
>  	if (!page_fault_can_be_fast(error_code))
>  		return false;
>  
> @@ -2895,9 +2906,55 @@
>  }
>  
>  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> -			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
> +			 gva_t gva, pfn_t *pfn, bool write, bool *writable,
> +			 bool pin);
>  static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
>  
> +
> +static int get_sptep_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes[4])
> +{
> +	struct kvm_shadow_walk_iterator iterator;
> +	int nr_sptes = 0;
> +
> +	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
> +		return nr_sptes;
> +
> +	for_each_shadow_entry(vcpu, addr, iterator) {
> +		sptes[iterator.level-1] = iterator.sptep;
> +		nr_sptes++;
> +		if (!is_shadow_present_pte(*iterator.sptep))
> +			break;
> +	}
> +
> +	return nr_sptes;
> +}
> +
> +static bool direct_pin_sptes(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u64 *sptes[4];
> +	int r, i, level;
> +
> +	r = get_sptep_hierarchy(vcpu, gfn << PAGE_SHIFT, sptes);
> +	if (!r)
> +		return false;
> +
> +	level = 5 - r;
> +	if (!is_last_spte(*sptes[r-1], level))
> +		return false;
> +	if (!is_shadow_present_pte(*sptes[r-1]))
> +		return false;
> +
> +	for (i = 0; i < r; i++) {
> +		u64 *sptep = sptes[i];
> +		struct kvm_mmu_page *sp = page_header(__pa(sptep));
> +
> +		sp->pinned = true;
> +		set_bit(SPTE_PINNED_BIT, (unsigned long *)sptep);
> +	}
> +
> +	return true;
> +}
> +
>  static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
>  			 gfn_t gfn, bool prefault, bool pin, bool *pinned)
>  {
> @@ -2923,13 +2980,14 @@
>  	} else
>  		level = PT_PAGE_TABLE_LEVEL;
>  
> -	if (fast_page_fault(vcpu, v, level, error_code))
> +	if (fast_page_fault(vcpu, v, level, error_code, pin))
>  		return 0;
>  
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
>  
> -	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
> +	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable,
> +			 pin))
>  		return 0;
>  
>  	if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r))
> @@ -2943,6 +3001,8 @@
>  		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
>  	r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn,
>  			 prefault);
> +	if (pin)
> +		*pinned = direct_pin_sptes(vcpu, gfn);
>  	spin_unlock(&vcpu->kvm->mmu_lock);
>  
>  
> @@ -3349,7 +3409,8 @@
>  }
>  
>  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> -			 gva_t gva, pfn_t *pfn, bool write, bool *writable)
> +			 gva_t gva, pfn_t *pfn, bool write, bool *writable,
> +			 bool pin)
>  {
>  	bool async;
>  
> @@ -3358,7 +3419,7 @@
>  	if (!async)
>  		return false; /* *pfn has correct page already */
>  
> -	if (!prefault && can_do_async_pf(vcpu)) {
> +	if (!prefault && !pin && can_do_async_pf(vcpu)) {
>  		trace_kvm_try_async_get_page(gva, gfn);
>  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>  			trace_kvm_async_pf_doublefault(gva, gfn);
> @@ -3406,13 +3467,14 @@
>  	} else
>  		level = PT_PAGE_TABLE_LEVEL;
>  
> -	if (fast_page_fault(vcpu, gpa, level, error_code))
> +	if (fast_page_fault(vcpu, gpa, level, error_code, pin))
>  		return 0;
>  
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
>  
> -	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
> +	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable,
> +			 pin))
>  		return 0;
>  
>  	if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r))
> @@ -3426,6 +3488,8 @@
>  		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
>  	r = __direct_map(vcpu, gpa, write, map_writable,
>  			 level, gfn, pfn, prefault);
> +	if (pin)
> +		*pinned = direct_pin_sptes(vcpu, gfn);
>  	spin_unlock(&vcpu->kvm->mmu_lock);
>  
>  	return r;
> @@ -3903,6 +3967,127 @@
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
>  
> +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
> +				  gfn_t base_gfn, unsigned long npages)
> +{
> +	struct kvm_pinned_page_range *p;
> +
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> +		if (p->base_gfn == base_gfn && p->npages == npages) {
> +			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +			return -EEXIST;
> +		}
> +	}
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +
> +	if (vcpu->arch.nr_pinned_ranges >=
> +	    KVM_MAX_PER_VCPU_PINNED_RANGE)
> +		return -ENOSPC;
Shouldn't we refuse to register pinned range if !TDP?

> +
> +	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	vcpu->arch.nr_pinned_ranges++;
> +
> +	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
> +
> +	INIT_LIST_HEAD(&p->link);
> +	p->base_gfn = base_gfn;
> +	p->npages = npages;
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +
> +	return 0;
> +}
> +
> +int kvm_mmu_unregister_pinned_range(struct kvm_vcpu *vcpu,
> +				    gfn_t base_gfn, unsigned long npages)
> +{
> +	struct kvm_pinned_page_range *p;
> +
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> +		if (p->base_gfn == base_gfn && p->npages == npages) {
> +			list_del(&p->link);
> +			vcpu->arch.nr_pinned_ranges--;
> +			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +			kfree(p);
> +			return 0;
> +		}
> +	}
> +
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +	return -ENOENT;
> +}
> +
> +void kvm_mmu_free_pinned_ranges(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pinned_page_range *p, *p2;
> +
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_for_each_entry_safe(p, p2, &vcpu->arch.pinned_mmu_pages, link) {
> +		list_del(&p->link);
> +		kfree(p);
> +	}
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +}
> +
> +/*
> + * Pin KVM MMU page translations. This guarantees, for valid
> + * addresses registered by kvm_mmu_register_pinned_range (valid address
> + * meaning address which posses sufficient information for fault to
> + * be resolved), valid translations exist while in guest mode and
> + * therefore no VM-exits due to faults will occur.
> + *
> + * Failure to instantiate pages will abort guest entry.
> + *
> + * Page frames should be pinned with get_page in advance.
> + *
> + * Pinning is not guaranteed while executing as L2 guest.
> + *
> + */
> +
> +static void kvm_mmu_pin_pages(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pinned_page_range *p;
> +
> +	if (is_guest_mode(vcpu))
> +		return;
> +
> +	if (!vcpu->arch.mmu.direct_map)
> +		return;
> +
> +	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> +
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> +		gfn_t gfn_offset;
> +
> +		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
> +			gfn_t gfn = p->base_gfn + gfn_offset;
> +			int r;
> +			bool pinned = false;
> +
> +			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
> +						     PFERR_WRITE_MASK, false,
> +						     true, &pinned);
> +			/* MMU notifier sequence window: retry */
> +			if (!r && !pinned)
> +				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +			if (r) {
> +				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
I do not think triple fault is appropriate here. The reasons for triple fault are
documented in SDM and this is not one of them. What about error exit to user space?

> +				break;
> +			}
> +
> +		}
> +	}
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +}
> +
>  int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  {
>  	int r;
> @@ -3916,6 +4101,7 @@
>  		goto out;
>  	/* set_cr3() should ensure TLB has been flushed */
>  	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
> +	kvm_mmu_pin_pages(vcpu);
>  out:
>  	return r;
>  }
> Index: kvm.pinned-sptes/arch/x86/kvm/mmu.h
> ===================================================================
> --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.h	2014-06-18 17:27:47.582549238 -0300
> +++ kvm.pinned-sptes/arch/x86/kvm/mmu.h	2014-06-18 17:28:24.339435654 -0300
> @@ -178,4 +178,9 @@
>  }
>  
>  void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
> +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
> +				 gfn_t base_gfn, unsigned long npages);
> +int kvm_mmu_unregister_pinned_range(struct kvm_vcpu *vcpu,
> +				 gfn_t base_gfn, unsigned long npages);
> +void kvm_mmu_free_pinned_ranges(struct kvm_vcpu *vcpu);
>  #endif
> Index: kvm.pinned-sptes/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm.pinned-sptes.orig/arch/x86/kvm/x86.c	2014-06-18 17:28:17.552456605 -0300
> +++ kvm.pinned-sptes/arch/x86/kvm/x86.c	2014-06-18 17:28:24.340435651 -0300
> @@ -7049,6 +7049,8 @@
>  
>  	kvm_async_pf_hash_reset(vcpu);
>  	kvm_pmu_init(vcpu);
> +	INIT_LIST_HEAD(&vcpu->arch.pinned_mmu_pages);
> +	mutex_init(&vcpu->arch.pinned_mmu_mutex);
>  
>  	return 0;
>  fail_free_wbinvd_dirty_mask:
> @@ -7069,6 +7071,7 @@
>  {
>  	int idx;
>  
> +	kvm_mmu_free_pinned_ranges(vcpu);
>  	kvm_pmu_destroy(vcpu);
>  	kfree(vcpu->arch.mce_banks);
>  	kvm_free_lapic(vcpu);
> @@ -7113,6 +7116,7 @@
>  	int r;
>  	r = vcpu_load(vcpu);
>  	BUG_ON(r);
> +	kvm_mmu_free_pinned_ranges(vcpu);
>  	kvm_mmu_unload(vcpu);
>  	vcpu_put(vcpu);
>  }
> Index: kvm.pinned-sptes/arch/x86/kvm/paging_tmpl.h
> ===================================================================
> --- kvm.pinned-sptes.orig/arch/x86/kvm/paging_tmpl.h	2014-06-18 17:28:17.550456611 -0300
> +++ kvm.pinned-sptes/arch/x86/kvm/paging_tmpl.h	2014-06-18 17:28:24.340435651 -0300
> @@ -747,7 +747,7 @@
>  	smp_rmb();
>  
>  	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
> -			 &map_writable))
> +			 &map_writable, false))
>  		return 0;
>  
>  	if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
> Index: kvm.pinned-sptes/arch/x86/kvm/mmutrace.h
> ===================================================================
> --- kvm.pinned-sptes.orig/arch/x86/kvm/mmutrace.h	2014-06-18 17:27:47.583549234 -0300
> +++ kvm.pinned-sptes/arch/x86/kvm/mmutrace.h	2014-06-18 17:28:24.340435651 -0300
> @@ -322,6 +322,29 @@
>  		  __entry->kvm_gen == __entry->spte_gen
>  	)
>  );
> +
> +TRACE_EVENT(
> +	kvm_mmu_register_pinned_range,
> +	TP_PROTO(unsigned int vcpu_id, gfn_t gfn, unsigned long npages),
> +	TP_ARGS(vcpu_id, gfn, npages),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned int,	vcpu_id	)
> +		__field(	gfn_t,		gfn	)
> +		__field(	unsigned long,	npages	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->vcpu_id		= vcpu_id;
> +		__entry->gfn			= gfn;
> +		__entry->npages			= npages;
> +	),
> +
> +	TP_printk("vcpu_id %u gfn %llx npages %lx",
> +		  __entry->vcpu_id,
> +		  __entry->gfn,
> +		  __entry->npages)
> +);
>  #endif /* _TRACE_KVMMMU_H */
>  
>  #undef TRACE_INCLUDE_PATH
> 
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-18 23:12 ` [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only) mtosatti
  2014-06-19  7:21   ` Gleb Natapov
@ 2014-06-19  8:01   ` Avi Kivity
  2014-06-19 14:06     ` Andi Kleen
  2014-06-19 18:26     ` Marcelo Tosatti
  2014-07-02  0:58   ` Nadav Amit
  2 siblings, 2 replies; 27+ messages in thread
From: Avi Kivity @ 2014-06-19  8:01 UTC (permalink / raw)
  To: mtosatti, kvm, ak; +Cc: pbonzini, xiaoguangrong, gleb


On 06/19/2014 02:12 AM, mtosatti@redhat.com wrote:
> Allow vcpus to pin spte translations by:
>
> 1) Creating a per-vcpu list of pinned ranges.
> 2) On mmu reload request:
> 	- Fault ranges.
> 	- Mark sptes with a pinned bit.
> 	- Mark shadow pages as pinned.
>
> 3) Then modify the following actions:
> 	- Page age => skip spte flush.
> 	- MMU notifiers => force mmu reload request (which kicks cpu out of
> 				guest mode).
> 	- GET_DIRTY_LOG => force mmu reload request.
> 	- SLAB shrinker => skip shadow page deletion.
>
> TDP-only.
>
>   
> +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
> +				  gfn_t base_gfn, unsigned long npages)
> +{
> +	struct kvm_pinned_page_range *p;
> +
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> +		if (p->base_gfn == base_gfn && p->npages == npages) {
> +			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +			return -EEXIST;
> +		}
> +	}
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +
> +	if (vcpu->arch.nr_pinned_ranges >=
> +	    KVM_MAX_PER_VCPU_PINNED_RANGE)
> +		return -ENOSPC;
> +
> +	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	vcpu->arch.nr_pinned_ranges++;
> +
> +	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
> +
> +	INIT_LIST_HEAD(&p->link);
> +	p->base_gfn = base_gfn;
> +	p->npages = npages;
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +
> +	return 0;
> +}
> +

What happens if ranges overlap (within a vcpu, cross-vcpu)? Or if a 
range overflows and wraps around 0?  Or if it does not refer to RAM?

Looks like you're limiting the number of ranges, but not the number of 
pages, so a guest can lock all of its memory.

> +
> +/*
> + * Pin KVM MMU page translations. This guarantees, for valid
> + * addresses registered by kvm_mmu_register_pinned_range (valid address
> + * meaning address which posses sufficient information for fault to
> + * be resolved), valid translations exist while in guest mode and
> + * therefore no VM-exits due to faults will occur.
> + *
> + * Failure to instantiate pages will abort guest entry.
> + *
> + * Page frames should be pinned with get_page in advance.
> + *
> + * Pinning is not guaranteed while executing as L2 guest.

Does this undermine security?

> + *
> + */
> +
> +static void kvm_mmu_pin_pages(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pinned_page_range *p;
> +
> +	if (is_guest_mode(vcpu))
> +		return;
> +
> +	if (!vcpu->arch.mmu.direct_map)
> +		return;
> +
> +	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> +
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);

Is the mutex actually needed? It seems it's only taken in vcpu context, 
so the vcpu mutex should be sufficient.

> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> +		gfn_t gfn_offset;
> +
> +		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
> +			gfn_t gfn = p->base_gfn + gfn_offset;
> +			int r;
> +			bool pinned = false;
> +
> +			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
> +						     PFERR_WRITE_MASK, false,
> +						     true, &pinned);
> +			/* MMU notifier sequence window: retry */
> +			if (!r && !pinned)
> +				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +			if (r) {
> +				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> +				break;
> +			}
> +
> +		}
> +	}
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +}
> +
>   int kvm_mmu_load(struct kvm_vcpu *vcpu)
>   {
>   	int r;
> @@ -3916,6 +4101,7 @@
>   		goto out;
>   	/* set_cr3() should ensure TLB has been flushed */
>   	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
> +	kvm_mmu_pin_pages(vcpu);
>   out:
>   	return r;
>   }
>

I don't see where  you unpin pages, so even if you limit the number of 
pinned pages, a guest can pin all of memory by iterating over all of 
memory and pinning it a chunk at a time.

You might try something similar to guest MTRR handling.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
  2014-06-18 23:12 ` [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path mtosatti
@ 2014-06-19  8:17   ` Gleb Natapov
  2014-06-19 18:40     ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Gleb Natapov @ 2014-06-19  8:17 UTC (permalink / raw)
  To: mtosatti; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosatti@redhat.com wrote:
> Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> deleting a pinned spte.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  arch/x86/kvm/mmu.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> ===================================================================
> --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-13 16:50:50.040140594 -0300
> +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-13 16:51:05.620104451 -0300
> @@ -1247,6 +1247,9 @@
>  		spte &= ~SPTE_MMU_WRITEABLE;
>  	spte = spte & ~PT_WRITABLE_MASK;
>  
> +	if (is_pinned_spte(spte))
> +		mmu_reload_pinned_vcpus(kvm);
> +
Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it anyway
on the next vmentry. Isn't it better to just report all pinned pages as dirty alway.

>  	return mmu_spte_update(sptep, spte);
>  }
>  
> 
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-19  8:01   ` Avi Kivity
@ 2014-06-19 14:06     ` Andi Kleen
  2014-06-19 18:26     ` Marcelo Tosatti
  1 sibling, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2014-06-19 14:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, kvm, pbonzini, xiaoguangrong, gleb

> >+ * Failure to instantiate pages will abort guest entry.
> >+ *
> >+ * Page frames should be pinned with get_page in advance.
> >+ *
> >+ * Pinning is not guaranteed while executing as L2 guest.
> 
> Does this undermine security?

It should not. In the worst case it'll randomly lose PEBS records.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-19  8:01   ` Avi Kivity
  2014-06-19 14:06     ` Andi Kleen
@ 2014-06-19 18:26     ` Marcelo Tosatti
  2014-06-22 13:35       ` Avi Kivity
  1 sibling, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2014-06-19 18:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, ak, pbonzini, xiaoguangrong, gleb

On Thu, Jun 19, 2014 at 11:01:06AM +0300, Avi Kivity wrote:
> 
> On 06/19/2014 02:12 AM, mtosatti@redhat.com wrote:
> >Allow vcpus to pin spte translations by:
> >
> >1) Creating a per-vcpu list of pinned ranges.
> >2) On mmu reload request:
> >	- Fault ranges.
> >	- Mark sptes with a pinned bit.
> >	- Mark shadow pages as pinned.
> >
> >3) Then modify the following actions:
> >	- Page age => skip spte flush.
> >	- MMU notifiers => force mmu reload request (which kicks cpu out of
> >				guest mode).
> >	- GET_DIRTY_LOG => force mmu reload request.
> >	- SLAB shrinker => skip shadow page deletion.
> >
> >TDP-only.
> >
> >+int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
> >+				  gfn_t base_gfn, unsigned long npages)
> >+{
> >+	struct kvm_pinned_page_range *p;
> >+
> >+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> >+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> >+		if (p->base_gfn == base_gfn && p->npages == npages) {
> >+			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> >+			return -EEXIST;
> >+		}
> >+	}
> >+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> >+
> >+	if (vcpu->arch.nr_pinned_ranges >=
> >+	    KVM_MAX_PER_VCPU_PINNED_RANGE)
> >+		return -ENOSPC;
> >+
> >+	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
> >+	if (!p)
> >+		return -ENOMEM;
> >+
> >+	vcpu->arch.nr_pinned_ranges++;
> >+
> >+	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
> >+
> >+	INIT_LIST_HEAD(&p->link);
> >+	p->base_gfn = base_gfn;
> >+	p->npages = npages;
> >+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> >+	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
> >+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> >+	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> >+
> >+	return 0;
> >+}
> >+
> 
> What happens if ranges overlap (within a vcpu, cross-vcpu)?

The page(s) are faulted multiple times if ranges overlap within a vcpu.

I see no reason to disallow overlapping ranges. Do you?

> Or if a range overflows and wraps around 0? 

Pagefault fails on vm-entry -> KVM_REQ_TRIPLE_FAULT.

Will double check for overflows to make sure.

> Or if it does not refer to RAM?

User should have pinned the page(s) before with gfn_to_page / get_page,
which ensures it is guest RAM ? (hum, although it might be good to 
double check here as well).

> Looks like you're limiting the number of ranges, but not the number
> of pages, so a guest can lock all of its memory.

Yes. The page pinning at get_page time can also lock all of
guest memory.

> >+
> >+/*
> >+ * Pin KVM MMU page translations. This guarantees, for valid
> >+ * addresses registered by kvm_mmu_register_pinned_range (valid address
> >+ * meaning address which posses sufficient information for fault to
> >+ * be resolved), valid translations exist while in guest mode and
> >+ * therefore no VM-exits due to faults will occur.
> >+ *
> >+ * Failure to instantiate pages will abort guest entry.
> >+ *
> >+ * Page frames should be pinned with get_page in advance.
> >+ *
> >+ * Pinning is not guaranteed while executing as L2 guest.
> 
> Does this undermine security?

PEBS writes should not be enabled when L2 guest is executing.

> >+static void kvm_mmu_pin_pages(struct kvm_vcpu *vcpu)
> >+{
> >+	struct kvm_pinned_page_range *p;
> >+
> >+	if (is_guest_mode(vcpu))
> >+		return;
> >+
> >+	if (!vcpu->arch.mmu.direct_map)
> >+		return;
> >+
> >+	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> >+
> >+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> 
> Is the mutex actually needed? It seems it's only taken in vcpu
> context, so the vcpu mutex should be sufficient.

Right. Actually the list_empty() access from kicker function might be unsafe.
Will double check.

> >+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> >+		gfn_t gfn_offset;
> >+
> >+		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
> >+			gfn_t gfn = p->base_gfn + gfn_offset;
> >+			int r;
> >+			bool pinned = false;
> >+
> >+			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
> >+						     PFERR_WRITE_MASK, false,
> >+						     true, &pinned);
> >+			/* MMU notifier sequence window: retry */
> >+			if (!r && !pinned)
> >+				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> >+			if (r) {
> >+				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> >+				break;
> >+			}
> >+
> >+		}
> >+	}
> >+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> >+}
> >+
> >  int kvm_mmu_load(struct kvm_vcpu *vcpu)
> >  {
> >  	int r;
> >@@ -3916,6 +4101,7 @@
> >  		goto out;
> >  	/* set_cr3() should ensure TLB has been flushed */
> >  	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
> >+	kvm_mmu_pin_pages(vcpu);
> >  out:
> >  	return r;
> >  }
> >
> 
> I don't see where  you unpin pages, so even if you limit the number
> of pinned pages, a guest can pin all of memory by iterating over all
> of memory and pinning it a chunk at a time.

The caller should be responsible for limiting number of pages pinned it
is pinning the struct pages?

And in that case, should remove any limiting from this interface, as
that is confusing.

> You might try something similar to guest MTRR handling.

Please be more verbose.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 3/5] KVM: MMU: notifiers support for pinned sptes
  2014-06-19  6:48   ` Gleb Natapov
@ 2014-06-19 18:28     ` Marcelo Tosatti
  2014-06-20 10:11       ` Gleb Natapov
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2014-06-19 18:28 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Thu, Jun 19, 2014 at 09:48:50AM +0300, Gleb Natapov wrote:
> On Wed, Jun 18, 2014 at 08:12:06PM -0300, mtosatti@redhat.com wrote:
> > Request KVM_REQ_MMU_RELOAD when deleting sptes from MMU notifiers.
> > 
> > Keep pinned sptes intact if page aging.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > ---
> >  arch/x86/kvm/mmu.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 62 insertions(+), 9 deletions(-)
> > 
> > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > ===================================================================
> > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-18 17:28:24.339435654 -0300
> > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-18 17:29:32.510225755 -0300
> > @@ -1184,6 +1184,42 @@
> >  		kvm_flush_remote_tlbs(vcpu->kvm);
> >  }
> >  
> > +static void ack_flush(void *_completed)
> > +{
> > +}
> > +
> > +static void mmu_reload_pinned_vcpus(struct kvm *kvm)
> > +{
> > +	int i, cpu, me;
> > +	cpumask_var_t cpus;
> > +	struct kvm_vcpu *vcpu;
> > +	unsigned int req = KVM_REQ_MMU_RELOAD;
> > +
> > +	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
> > +
> > +	me = get_cpu();
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		if (list_empty(&vcpu->arch.pinned_mmu_pages))
> > +			continue;
> > +		kvm_make_request(req, vcpu);
> > +		cpu = vcpu->cpu;
> > +
> > +		/* Set ->requests bit before we read ->mode */
> > +		smp_mb();
> > +
> > +		if (cpus != NULL && cpu != -1 && cpu != me &&
> > +		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
> > +			cpumask_set_cpu(cpu, cpus);
> > +	}
> > +	if (unlikely(cpus == NULL))
> > +		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
> > +	else if (!cpumask_empty(cpus))
> > +		smp_call_function_many(cpus, ack_flush, NULL, 1);
> > +	put_cpu();
> > +	free_cpumask_var(cpus);
> > +	return;
> > +}
> This is a c&p of make_all_cpus_request(), the only difference is checking
> of vcpu->arch.pinned_mmu_pages.  You can add make_some_cpus_request(..., bool (*predicate)(struct kvm_vcpu *))
> to kvm_main.c and rewrite make_all_cpus_request() to use it instead.

Half-way through it i decided it was better to c&p.

Can change make_all_cpus_request() though if it makes more sense to you.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
  2014-06-19  8:17   ` Gleb Natapov
@ 2014-06-19 18:40     ` Marcelo Tosatti
  2014-06-20 10:46       ` Gleb Natapov
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2014-06-19 18:40 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosatti@redhat.com wrote:
> > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > deleting a pinned spte.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > ---
> >  arch/x86/kvm/mmu.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > ===================================================================
> > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-13 16:50:50.040140594 -0300
> > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-13 16:51:05.620104451 -0300
> > @@ -1247,6 +1247,9 @@
> >  		spte &= ~SPTE_MMU_WRITEABLE;
> >  	spte = spte & ~PT_WRITABLE_MASK;
> >  
> > +	if (is_pinned_spte(spte))
> > +		mmu_reload_pinned_vcpus(kvm);
> > +
> Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it anyway
> on the next vmentry. Isn't it better to just report all pinned pages as dirty alway.

That was the initial plan, however its awkward to stop vcpus, execute
get_dirty_log twice, and have pages marked as dirty on the second
execution.

That is, it is in "incorrect" to report pages as dirty when they are
clean.

Moreover, if the number of pinned pages is larger than the dirty
threshold to stop VM and migrate, you'll never migrate. If vcpus are
in HLT and don't VM-enter immediately, the pages should not be refaulted
right away.

Do you think the optimization is worthwhile ?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-19  7:21   ` Gleb Natapov
@ 2014-06-19 19:22     ` Marcelo Tosatti
  2014-06-20 10:09       ` Gleb Natapov
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2014-06-19 19:22 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Thu, Jun 19, 2014 at 10:21:16AM +0300, Gleb Natapov wrote:
> On Wed, Jun 18, 2014 at 08:12:05PM -0300, mtosatti@redhat.com wrote:
> > Allow vcpus to pin spte translations by:
> > 
> > 1) Creating a per-vcpu list of pinned ranges.
> What if memory slot containing pinned range is going away?

->page_fault() should fail and guest abort. Will double check.

> > 2) On mmu reload request:
> > 	- Fault ranges.
> > 	- Mark sptes with a pinned bit.
> Should also be marked "dirty" as per SDM:
>  The three DS save area sections should be allocated from a non-paged pool, and marked accessed and dirty

This (SDM text) is about guest pagetable AFAICS.

> > +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> > +
> > +	if (vcpu->arch.nr_pinned_ranges >=
> > +	    KVM_MAX_PER_VCPU_PINNED_RANGE)
> > +		return -ENOSPC;
> Shouldn't we refuse to register pinned range if !TDP?

Sure.

> > +			/* MMU notifier sequence window: retry */
> > +			if (!r && !pinned)
> > +				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> > +			if (r) {
> > +				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> I do not think triple fault is appropriate here. The reasons for triple fault are
> documented in SDM and this is not one of them. What about error exit to user space?

Agree, will change.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-19 19:22     ` Marcelo Tosatti
@ 2014-06-20 10:09       ` Gleb Natapov
  2014-06-30 20:46         ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Gleb Natapov @ 2014-06-20 10:09 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Thu, Jun 19, 2014 at 04:22:57PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 10:21:16AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:05PM -0300, mtosatti@redhat.com wrote:
> > > Allow vcpus to pin spte translations by:
> > > 
> > > 1) Creating a per-vcpu list of pinned ranges.
> > What if memory slot containing pinned range is going away?
> 
> ->page_fault() should fail and guest abort. Will double check.
> 
> > > 2) On mmu reload request:
> > > 	- Fault ranges.
> > > 	- Mark sptes with a pinned bit.
> > Should also be marked "dirty" as per SDM:
> >  The three DS save area sections should be allocated from a non-paged pool, and marked accessed and dirty
> 
> This (SDM text) is about guest pagetable AFAICS.
> 
Its hard to say. SDM does not mention virtualization or two dimensional
paging in that section at all. My reading is that this section talks about
all translations that CPU should perform to get to the physical address,
otherwise why are we trying hard to make sure that EPT translations are
always present? Because the same paragraph say in the next sentence:

 It is the responsibility of the operating system to keep the pages that
 contain the buffer present and to mark them accessed and dirty

So it we take from it that translation should be present the same goes for
accessed and dirty. If Andi can clarify this within Intel it would be great.

--
			Gleb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 3/5] KVM: MMU: notifiers support for pinned sptes
  2014-06-19 18:28     ` Marcelo Tosatti
@ 2014-06-20 10:11       ` Gleb Natapov
  0 siblings, 0 replies; 27+ messages in thread
From: Gleb Natapov @ 2014-06-20 10:11 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Thu, Jun 19, 2014 at 03:28:25PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 09:48:50AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:06PM -0300, mtosatti@redhat.com wrote:
> > > Request KVM_REQ_MMU_RELOAD when deleting sptes from MMU notifiers.
> > > 
> > > Keep pinned sptes intact if page aging.
> > > 
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++-------
> > >  1 file changed, 62 insertions(+), 9 deletions(-)
> > > 
> > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > ===================================================================
> > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-18 17:28:24.339435654 -0300
> > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-18 17:29:32.510225755 -0300
> > > @@ -1184,6 +1184,42 @@
> > >  		kvm_flush_remote_tlbs(vcpu->kvm);
> > >  }
> > >  
> > > +static void ack_flush(void *_completed)
> > > +{
> > > +}
> > > +
> > > +static void mmu_reload_pinned_vcpus(struct kvm *kvm)
> > > +{
> > > +	int i, cpu, me;
> > > +	cpumask_var_t cpus;
> > > +	struct kvm_vcpu *vcpu;
> > > +	unsigned int req = KVM_REQ_MMU_RELOAD;
> > > +
> > > +	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
> > > +
> > > +	me = get_cpu();
> > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +		if (list_empty(&vcpu->arch.pinned_mmu_pages))
> > > +			continue;
> > > +		kvm_make_request(req, vcpu);
> > > +		cpu = vcpu->cpu;
> > > +
> > > +		/* Set ->requests bit before we read ->mode */
> > > +		smp_mb();
> > > +
> > > +		if (cpus != NULL && cpu != -1 && cpu != me &&
> > > +		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
> > > +			cpumask_set_cpu(cpu, cpus);
> > > +	}
> > > +	if (unlikely(cpus == NULL))
> > > +		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
> > > +	else if (!cpumask_empty(cpus))
> > > +		smp_call_function_many(cpus, ack_flush, NULL, 1);
> > > +	put_cpu();
> > > +	free_cpumask_var(cpus);
> > > +	return;
> > > +}
> > This is a c&p of make_all_cpus_request(), the only difference is checking
> > of vcpu->arch.pinned_mmu_pages.  You can add make_some_cpus_request(..., bool (*predicate)(struct kvm_vcpu *))
> > to kvm_main.c and rewrite make_all_cpus_request() to use it instead.
> 
> Half-way through it i decided it was better to c&p.
> 
> Can change make_all_cpus_request() though if it makes more sense to you.
> 
If I haven't missed anything and checking of pinned_mmu_pages is indeed the
only difference, then yes, reusing make_all_cpus_request() makes more sense.

--
			Gleb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
  2014-06-19 18:40     ` Marcelo Tosatti
@ 2014-06-20 10:46       ` Gleb Natapov
  2014-06-30 20:59         ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Gleb Natapov @ 2014-06-20 10:46 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Thu, Jun 19, 2014 at 03:40:31PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosatti@redhat.com wrote:
> > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > deleting a pinned spte.
> > > 
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.c |    3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > ===================================================================
> > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-13 16:50:50.040140594 -0300
> > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-13 16:51:05.620104451 -0300
> > > @@ -1247,6 +1247,9 @@
> > >  		spte &= ~SPTE_MMU_WRITEABLE;
> > >  	spte = spte & ~PT_WRITABLE_MASK;
> > >  
> > > +	if (is_pinned_spte(spte))
> > > +		mmu_reload_pinned_vcpus(kvm);
> > > +
> > Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it anyway
> > on the next vmentry. Isn't it better to just report all pinned pages as dirty alway.
> 
> That was the initial plan, however its awkward to stop vcpus, execute
> get_dirty_log twice, and have pages marked as dirty on the second
> execution.
Indeed, but I think it may happen today with vhost (or even other devices
that emulate dma), so userspace should be able to deal with it already.

> 
> That is, it is in "incorrect" to report pages as dirty when they are
> clean.
In case of PEBS area we cannot know. CPU writes there directly without
KVM even knowing about it so the only sane thing is to assume that after
a vmentry PEBS area is dirty. This is what happens with this patch BTW,
mmu_reload_pinned_vcpus() will force ->page_fault(FERR_WRITE) which will
mark all pinned pages as dirty even if pages are never written to. You
can achieve the same by having vcpu->pinned_page_dirty which is set to
true on each vmentry and to false on each GET_DIRTY_LOG.

> 
> Moreover, if the number of pinned pages is larger than the dirty
> threshold to stop VM and migrate, you'll never migrate. If vcpus are
> in HLT and don't VM-enter immediately, the pages should not be refaulted
> right away.
We should not allow that many pinned pages for security reasons. And
having a lot of page to fault in on a next vmentry after each
GET_DIRTY_LOG will slow down a guest during migration considerably.

> 
> Do you think the optimization is worthwhile ?
> 
I think it's simpler, but even if we will go with your current approach
it should be improved: there is no point sending IPI to all vcpus in
spte_write_protect() like you do here since the IPI will be send anyway at
the end of write protect because of ptes writable->nonwritable transition,
so all you have to do here is to set KVM_REQ_MMU_RELOAD, not need for IPI.

In fact this makes me thinking that write protecting pinned page here is
incorrect because old translation may not be in TLB and if CPU will try to
write PEBS entry after pte is write protected but before MMU is reloaded
it will encounter non writable pte and god knows what will happen, SDM
does not tell. So spte_write_protect() should be something like that IMO:

static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
{
    u64 spte = *sptep;

    if (!is_writable_pte(spte) &&
          !(pt_protect && spte_is_locklessly_modifiable(spte)))
        return false;

    if (!pt_protect && !is_pinned_spte(spte)) {
        for_each_vcpu()
             kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
        return true;
    }

    rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

    if (pt_protect)
        spte &= ~SPTE_MMU_WRITEABLE;

    spte = spte & ~PT_WRITABLE_MASK;

    return mmu_spte_update(sptep, spte);
}



--
			Gleb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-19 18:26     ` Marcelo Tosatti
@ 2014-06-22 13:35       ` Avi Kivity
  2014-07-09 13:25         ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2014-06-22 13:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, ak, pbonzini, xiaoguangrong, gleb


On 06/19/2014 09:26 PM, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 11:01:06AM +0300, Avi Kivity wrote:
>> On 06/19/2014 02:12 AM, mtosatti@redhat.com wrote:
>>> Allow vcpus to pin spte translations by:
>>>
>>> 1) Creating a per-vcpu list of pinned ranges.
>>> 2) On mmu reload request:
>>> 	- Fault ranges.
>>> 	- Mark sptes with a pinned bit.
>>> 	- Mark shadow pages as pinned.
>>>
>>> 3) Then modify the following actions:
>>> 	- Page age => skip spte flush.
>>> 	- MMU notifiers => force mmu reload request (which kicks cpu out of
>>> 				guest mode).
>>> 	- GET_DIRTY_LOG => force mmu reload request.
>>> 	- SLAB shrinker => skip shadow page deletion.
>>>
>>> TDP-only.
>>>
>>> +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
>>> +				  gfn_t base_gfn, unsigned long npages)
>>> +{
>>> +	struct kvm_pinned_page_range *p;
>>> +
>>> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
>>> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
>>> +		if (p->base_gfn == base_gfn && p->npages == npages) {
>>> +			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
>>> +			return -EEXIST;
>>> +		}
>>> +	}
>>> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
>>> +
>>> +	if (vcpu->arch.nr_pinned_ranges >=
>>> +	    KVM_MAX_PER_VCPU_PINNED_RANGE)
>>> +		return -ENOSPC;
>>> +
>>> +	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
>>> +	if (!p)
>>> +		return -ENOMEM;
>>> +
>>> +	vcpu->arch.nr_pinned_ranges++;
>>> +
>>> +	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
>>> +
>>> +	INIT_LIST_HEAD(&p->link);
>>> +	p->base_gfn = base_gfn;
>>> +	p->npages = npages;
>>> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
>>> +	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
>>> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
>>> +	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
>>> +
>>> +	return 0;
>>> +}
>>> +
>> What happens if ranges overlap (within a vcpu, cross-vcpu)?
> The page(s) are faulted multiple times if ranges overlap within a vcpu.
>
> I see no reason to disallow overlapping ranges. Do you?

Not really.  Just making sure nothing horrible happens.

>
>> Or if a range overflows and wraps around 0?
> Pagefault fails on vm-entry -> KVM_REQ_TRIPLE_FAULT.
>
> Will double check for overflows to make sure.

Will the loop terminate?

>> Looks like you're limiting the number of ranges, but not the number
>> of pages, so a guest can lock all of its memory.
> Yes. The page pinning at get_page time can also lock all of
> guest memory.

I'm sure that can't be good.  Maybe subject this pinning to the task 
mlock limit.

>
>>> +
>>> +/*
>>> + * Pin KVM MMU page translations. This guarantees, for valid
>>> + * addresses registered by kvm_mmu_register_pinned_range (valid address
>>> + * meaning address which posses sufficient information for fault to
>>> + * be resolved), valid translations exist while in guest mode and
>>> + * therefore no VM-exits due to faults will occur.
>>> + *
>>> + * Failure to instantiate pages will abort guest entry.
>>> + *
>>> + * Page frames should be pinned with get_page in advance.
>>> + *
>>> + * Pinning is not guaranteed while executing as L2 guest.
>> Does this undermine security?
> PEBS writes should not be enabled when L2 guest is executing.

What prevents L1 for setting up PEBS MSRs for L2?

>>> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
>>> +		gfn_t gfn_offset;
>>> +
>>> +		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
>>> +			gfn_t gfn = p->base_gfn + gfn_offset;
>>> +			int r;
>>> +			bool pinned = false;
>>> +
>>> +			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
>>> +						     PFERR_WRITE_MASK, false,
>>> +						     true, &pinned);
>>> +			/* MMU notifier sequence window: retry */
>>> +			if (!r && !pinned)
>>> +				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
>>> +			if (r) {
>>> +				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>>> +				break;
>>> +			}
>>> +
>>> +		}
>>> +	}
>>> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
>>> +}
>>> +
>>>   int kvm_mmu_load(struct kvm_vcpu *vcpu)
>>>   {
>>>   	int r;
>>> @@ -3916,6 +4101,7 @@
>>>   		goto out;
>>>   	/* set_cr3() should ensure TLB has been flushed */
>>>   	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
>>> +	kvm_mmu_pin_pages(vcpu);
>>>   out:
>>>   	return r;
>>>   }
>>>
>> I don't see where  you unpin pages, so even if you limit the number
>> of pinned pages, a guest can pin all of memory by iterating over all
>> of memory and pinning it a chunk at a time.
> The caller should be responsible for limiting number of pages pinned it
> is pinning the struct pages?

The caller would be the debug store data are MSR callbacks. How would 
they know what the limit it?

>
> And in that case, should remove any limiting from this interface, as
> that is confusing.
>
>> You might try something similar to guest MTRR handling.
> Please be more verbose.
>

mtrr_state already provides physical range attributes that are looked up 
on every fault, so I thought you could get the pinned attribute from 
there. But I guess that's too late, you want to pre-fault eveything.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-20 10:09       ` Gleb Natapov
@ 2014-06-30 20:46         ` Marcelo Tosatti
  2014-06-30 22:00           ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2014-06-30 20:46 UTC (permalink / raw)
  To: Gleb Natapov, ak; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Fri, Jun 20, 2014 at 01:09:12PM +0300, Gleb Natapov wrote:
> On Thu, Jun 19, 2014 at 04:22:57PM -0300, Marcelo Tosatti wrote:
> > On Thu, Jun 19, 2014 at 10:21:16AM +0300, Gleb Natapov wrote:
> > > On Wed, Jun 18, 2014 at 08:12:05PM -0300, mtosatti@redhat.com wrote:
> > > > Allow vcpus to pin spte translations by:
> > > > 
> > > > 1) Creating a per-vcpu list of pinned ranges.
> > > What if memory slot containing pinned range is going away?
> > 
> > ->page_fault() should fail and guest abort. Will double check.
> > 
> > > > 2) On mmu reload request:
> > > > 	- Fault ranges.
> > > > 	- Mark sptes with a pinned bit.
> > > Should also be marked "dirty" as per SDM:
> > >  The three DS save area sections should be allocated from a non-paged pool, and marked accessed and dirty
> > 
> > This (SDM text) is about guest pagetable AFAICS.
> > 
> Its hard to say. SDM does not mention virtualization or two dimensional
> paging in that section at all. My reading is that this section talks about
> all translations that CPU should perform to get to the physical address,
> otherwise why are we trying hard to make sure that EPT translations are
> always present? Because the same paragraph say in the next sentence:
> 
>  It is the responsibility of the operating system to keep the pages that
>  contain the buffer present and to mark them accessed and dirty
> 
> So it we take from it that translation should be present the same goes for
> accessed and dirty. If Andi can clarify this within Intel it would be great.

Andi?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
  2014-06-20 10:46       ` Gleb Natapov
@ 2014-06-30 20:59         ` Marcelo Tosatti
  2014-07-01  6:27           ` Gleb Natapov
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2014-06-30 20:59 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Fri, Jun 20, 2014 at 01:46:10PM +0300, Gleb Natapov wrote:
> On Thu, Jun 19, 2014 at 03:40:31PM -0300, Marcelo Tosatti wrote:
> > On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> > > On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosatti@redhat.com wrote:
> > > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > > deleting a pinned spte.
> > > > 
> > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > > 
> > > > ---
> > > >  arch/x86/kvm/mmu.c |    3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > > ===================================================================
> > > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-13 16:50:50.040140594 -0300
> > > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-13 16:51:05.620104451 -0300
> > > > @@ -1247,6 +1247,9 @@
> > > >  		spte &= ~SPTE_MMU_WRITEABLE;
> > > >  	spte = spte & ~PT_WRITABLE_MASK;
> > > >  
> > > > +	if (is_pinned_spte(spte))
> > > > +		mmu_reload_pinned_vcpus(kvm);
> > > > +
> > > Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it anyway
> > > on the next vmentry. Isn't it better to just report all pinned pages as dirty alway.
> > 
> > That was the initial plan, however its awkward to stop vcpus, execute
> > get_dirty_log twice, and have pages marked as dirty on the second
> > execution.
> Indeed, but I think it may happen today with vhost (or even other devices
> that emulate dma), so userspace should be able to deal with it already.
> 
> > 
> > That is, it is in "incorrect" to report pages as dirty when they are
> > clean.
> In case of PEBS area we cannot know. CPU writes there directly without
> KVM even knowing about it so the only sane thing is to assume that after
> a vmentry PEBS area is dirty. This is what happens with this patch BTW,
> mmu_reload_pinned_vcpus() will force ->page_fault(FERR_WRITE) which will
> mark all pinned pages as dirty even if pages are never written to. You
> can achieve the same by having vcpu->pinned_page_dirty which is set to
> true on each vmentry and to false on each GET_DIRTY_LOG.
> 
> > 
> > Moreover, if the number of pinned pages is larger than the dirty
> > threshold to stop VM and migrate, you'll never migrate. If vcpus are
> > in HLT and don't VM-enter immediately, the pages should not be refaulted
> > right away.
> We should not allow that many pinned pages for security reasons. And
> having a lot of page to fault in on a next vmentry after each
> GET_DIRTY_LOG will slow down a guest during migration considerably.
> 
> > 
> > Do you think the optimization is worthwhile ?
> > 
> I think it's simpler, but even if we will go with your current approach
> it should be improved: there is no point sending IPI to all vcpus in
> spte_write_protect() like you do here since the IPI will be send anyway at
> the end of write protect because of ptes writable->nonwritable transition,
> so all you have to do here is to set KVM_REQ_MMU_RELOAD, not need for IPI.

No, you have to make sure the vcpu is out of guest mode.

> In fact this makes me thinking that write protecting pinned page here is
> incorrect because old translation may not be in TLB and if CPU will try to
> write PEBS entry after pte is write protected but before MMU is reloaded
> it will encounter non writable pte and god knows what will happen, SDM
> does not tell. So spte_write_protect() should be something like that IMO:

The vcpu will never see a read-only spte because the VM-exit (due to
IPI) guarantees vcpu is outside of guest mode _before_ it is write
protected.

So i ask you: do you still hold the "current approach should be
improved" position ?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-30 20:46         ` Marcelo Tosatti
@ 2014-06-30 22:00           ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2014-06-30 22:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, kvm, pbonzini, xiaoguangrong, avi

> > So it we take from it that translation should be present the same goes for
> > accessed and dirty. If Andi can clarify this within Intel it would be great.
> 
> Andi?

There were some problems on really old CPUs with non dirty/accessed pages
(P4 generation) with PEBS. But PEBS virtualization is white listed and will
never run on those.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
  2014-06-30 20:59         ` Marcelo Tosatti
@ 2014-07-01  6:27           ` Gleb Natapov
  2014-07-01 17:50             ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Gleb Natapov @ 2014-07-01  6:27 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Mon, Jun 30, 2014 at 05:59:02PM -0300, Marcelo Tosatti wrote:
> On Fri, Jun 20, 2014 at 01:46:10PM +0300, Gleb Natapov wrote:
> > On Thu, Jun 19, 2014 at 03:40:31PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> > > > On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosatti@redhat.com wrote:
> > > > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > > > deleting a pinned spte.
> > > > > 
> > > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > > > 
> > > > > ---
> > > > >  arch/x86/kvm/mmu.c |    3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > > 
> > > > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > > > ===================================================================
> > > > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-13 16:50:50.040140594 -0300
> > > > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-13 16:51:05.620104451 -0300
> > > > > @@ -1247,6 +1247,9 @@
> > > > >  		spte &= ~SPTE_MMU_WRITEABLE;
> > > > >  	spte = spte & ~PT_WRITABLE_MASK;
> > > > >  
> > > > > +	if (is_pinned_spte(spte))
> > > > > +		mmu_reload_pinned_vcpus(kvm);
> > > > > +
> > > > Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it anyway
> > > > on the next vmentry. Isn't it better to just report all pinned pages as dirty alway.
> > > 
> > > That was the initial plan, however its awkward to stop vcpus, execute
> > > get_dirty_log twice, and have pages marked as dirty on the second
> > > execution.
> > Indeed, but I think it may happen today with vhost (or even other devices
> > that emulate dma), so userspace should be able to deal with it already.
> > 
> > > 
> > > That is, it is in "incorrect" to report pages as dirty when they are
> > > clean.
> > In case of PEBS area we cannot know. CPU writes there directly without
> > KVM even knowing about it so the only sane thing is to assume that after
> > a vmentry PEBS area is dirty. This is what happens with this patch BTW,
> > mmu_reload_pinned_vcpus() will force ->page_fault(FERR_WRITE) which will
> > mark all pinned pages as dirty even if pages are never written to. You
> > can achieve the same by having vcpu->pinned_page_dirty which is set to
> > true on each vmentry and to false on each GET_DIRTY_LOG.
> > 
> > > 
> > > Moreover, if the number of pinned pages is larger than the dirty
> > > threshold to stop VM and migrate, you'll never migrate. If vcpus are
> > > in HLT and don't VM-enter immediately, the pages should not be refaulted
> > > right away.
> > We should not allow that many pinned pages for security reasons. And
> > having a lot of page to fault in on a next vmentry after each
> > GET_DIRTY_LOG will slow down a guest during migration considerably.
> > 
> > > 
> > > Do you think the optimization is worthwhile ?
> > > 
> > I think it's simpler, but even if we will go with your current approach
> > it should be improved: there is no point sending IPI to all vcpus in
> > spte_write_protect() like you do here since the IPI will be send anyway at
> > the end of write protect because of ptes writable->nonwritable transition,
> > so all you have to do here is to set KVM_REQ_MMU_RELOAD, not need for IPI.
> 
> No, you have to make sure the vcpu is out of guest mode.
> 
> > In fact this makes me thinking that write protecting pinned page here is
> > incorrect because old translation may not be in TLB and if CPU will try to
> > write PEBS entry after pte is write protected but before MMU is reloaded
> > it will encounter non writable pte and god knows what will happen, SDM
> > does not tell. So spte_write_protect() should be something like that IMO:
> 
> The vcpu will never see a read-only spte because the VM-exit (due to
> IPI) guarantees vcpu is outside of guest mode _before_ it is write
> protected.
Right. Now I see why you absolutely have to send IPI in mmu_reload_pinned_vcpus()
before marking pte as read only. And kvm->mmu_lock is what will prevent vcpu from
re-entering guest mode again before pte is marked read only, right?

> 
> So i ask you: do you still hold the "current approach should be
> improved" position ?
> 
As I said IMO what I proposed is much simpler and not as tricky as what you have here.
It also has an advantage of not slowing down next guest entry after GET_DIRTY_LOG because
it does not require mmu reload and page_faulting in pinned pages.

--
			Gleb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
  2014-07-01  6:27           ` Gleb Natapov
@ 2014-07-01 17:50             ` Marcelo Tosatti
  0 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2014-07-01 17:50 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, ak, pbonzini, xiaoguangrong, avi

On Tue, Jul 01, 2014 at 09:27:19AM +0300, Gleb Natapov wrote:
> > The vcpu will never see a read-only spte because the VM-exit (due to
> > IPI) guarantees vcpu is outside of guest mode _before_ it is write
> > protected.
> Right. Now I see why you absolutely have to send IPI in mmu_reload_pinned_vcpus()
> before marking pte as read only. And kvm->mmu_lock is what will prevent vcpu from
> re-entering guest mode again before pte is marked read only, right?

Yes.

> > So i ask you: do you still hold the "current approach should be
> > improved" position ?
> > 
> As I said IMO what I proposed is much simpler and not as tricky as what you have here.
> It also has an advantage of not slowing down next guest entry after GET_DIRTY_LOG because
> it does not require mmu reload and page_faulting in pinned pages.

Ok sure.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-18 23:12 ` [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only) mtosatti
  2014-06-19  7:21   ` Gleb Natapov
  2014-06-19  8:01   ` Avi Kivity
@ 2014-07-02  0:58   ` Nadav Amit
  2 siblings, 0 replies; 27+ messages in thread
From: Nadav Amit @ 2014-07-02  0:58 UTC (permalink / raw)
  To: mtosatti, kvm, ak; +Cc: pbonzini, xiaoguangrong, gleb, avi

I tried to use the patches (regardless of PEBS).
When the host uses huge-pages, I get an a oops.
I did not delve into the patch details, but I think there is some mess 
with the indexes - see below:

On 6/19/14, 2:12 AM, mtosatti@redhat.com wrote:
> +
> +static bool direct_pin_sptes(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u64 *sptes[4];
> +	int r, i, level;
> +
> +	r = get_sptep_hierarchy(vcpu, gfn << PAGE_SHIFT, sptes);
> +	if (!r)
> +		return false;
> +
> +	level = 5 - r;
> +	if (!is_last_spte(*sptes[r-1], level))
should be *sptes[level-1]
> +		return false;
> +	if (!is_shadow_present_pte(*sptes[r-1]))
should be *sptes[level-1]
> +		return false;
> +
> +	for (i = 0; i < r; i++) {
> +		u64 *sptep = sptes[i];
should be sptes[3 - i];
> +		struct kvm_mmu_page *sp = page_header(__pa(sptep));
> +
> +		sp->pinned = true;
> +		set_bit(SPTE_PINNED_BIT, (unsigned long *)sptep);
> +	}
> +
> +	return true;
> +}
>

Regards,
Nadav

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
  2014-06-22 13:35       ` Avi Kivity
@ 2014-07-09 13:25         ` Marcelo Tosatti
  0 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2014-07-09 13:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, ak, pbonzini, xiaoguangrong, gleb

On Sun, Jun 22, 2014 at 04:35:24PM +0300, Avi Kivity wrote:
> >>>+ * Failure to instantiate pages will abort guest entry.
> >>>+ *
> >>>+ * Page frames should be pinned with get_page in advance.
> >>>+ *
> >>>+ * Pinning is not guaranteed while executing as L2 guest.
> >>Does this undermine security?
> >PEBS writes should not be enabled when L2 guest is executing.
> 
> What prevents L1 for setting up PEBS MSRs for L2?

L2 should set up PEBS MSR, not L1 setup MSRs for L2.

In case L2 sets up PEBS, L1->L2 switch should pin pages 
as well. 

But since PEBS is not supported for L2 ATM, i'll keep the code
as is so it can be fixed later.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-07-09 13:26 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-18 23:12 [patch 0/5] KVM: support for pinning sptes mtosatti
2014-06-18 23:12 ` [patch 1/5] KVM: x86: add pinned parameter to page_fault methods mtosatti
2014-06-18 23:12 ` [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only) mtosatti
2014-06-19  7:21   ` Gleb Natapov
2014-06-19 19:22     ` Marcelo Tosatti
2014-06-20 10:09       ` Gleb Natapov
2014-06-30 20:46         ` Marcelo Tosatti
2014-06-30 22:00           ` Andi Kleen
2014-06-19  8:01   ` Avi Kivity
2014-06-19 14:06     ` Andi Kleen
2014-06-19 18:26     ` Marcelo Tosatti
2014-06-22 13:35       ` Avi Kivity
2014-07-09 13:25         ` Marcelo Tosatti
2014-07-02  0:58   ` Nadav Amit
2014-06-18 23:12 ` [patch 3/5] KVM: MMU: notifiers support for pinned sptes mtosatti
2014-06-19  6:48   ` Gleb Natapov
2014-06-19 18:28     ` Marcelo Tosatti
2014-06-20 10:11       ` Gleb Natapov
2014-06-18 23:12 ` [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path mtosatti
2014-06-19  8:17   ` Gleb Natapov
2014-06-19 18:40     ` Marcelo Tosatti
2014-06-20 10:46       ` Gleb Natapov
2014-06-30 20:59         ` Marcelo Tosatti
2014-07-01  6:27           ` Gleb Natapov
2014-07-01 17:50             ` Marcelo Tosatti
2014-06-18 23:12 ` [patch 5/5] KVM: MMU: pinned sps are not candidates for deletion mtosatti
2014-06-19  1:44 ` [patch 0/5] KVM: support for pinning sptes Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.