kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory
@ 2020-05-14 18:05 Vitaly Kuznetsov
  2020-05-14 18:05 ` [PATCH RFC 1/5] KVM: rename labels in kvm_init() Vitaly Kuznetsov
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-14 18:05 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Michael Tsirkin, Julia Suvorova, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, x86

The idea of the patchset was suggested by Michael S. Tsirkin.

PCIe config space can (depending on the configuration) be quite big but
usually is sparsely populated. Guest may scan it by accessing individual
device's page which, when device is missing, is supposed to have 'pci
holes' semantics: reads return '0xff' and writes get discarded. Currently,
userspace has to allocate real memory for these holes and fill them with
'0xff'. Moreover, different VMs usually require different memory.

The idea behind the feature introduced by this patch is: let's have a
single read-only page filled with '0xff' in KVM and map it to all such
PCI holes in all VMs. This will free userspace of obligation to allocate
real memory and also allow us to speed up access to these holes as we
can aggressively map the whole slot upon first fault.

RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.

Patches are against kvm/next.

Vitaly Kuznetsov (5):
  KVM: rename labels in kvm_init()
  KVM: x86: introduce KVM_MEM_ALLONES memory
  KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
  KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots
  KVM: selftests: add KVM_MEM_ALLONES test

 Documentation/virt/kvm/api.rst                |  22 ++--
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  34 ++++--
 arch/x86/kvm/mmu/paging_tmpl.h                |  30 ++++-
 arch/x86/kvm/x86.c                            |   9 +-
 include/linux/kvm_host.h                      |  15 ++-
 include/uapi/linux/kvm.h                      |   2 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  81 +++++++------
 .../kvm/x86_64/memory_region_allones.c        | 112 ++++++++++++++++++
 virt/kvm/kvm_main.c                           | 110 +++++++++++++----
 12 files changed, 342 insertions(+), 76 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/memory_region_allones.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/5] KVM: rename labels in kvm_init()
  2020-05-14 18:05 [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Vitaly Kuznetsov
@ 2020-05-14 18:05 ` Vitaly Kuznetsov
  2020-05-14 18:05 ` [PATCH RFC 2/5] KVM: x86: introduce KVM_MEM_ALLONES memory Vitaly Kuznetsov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-14 18:05 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Michael Tsirkin, Julia Suvorova, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, x86

Label names in kvm_init() are horrible, rename them to make it obvious
what we are going to do on the failure path.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 virt/kvm/kvm_main.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33e1eee96f75..892ea0b9087e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4674,7 +4674,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 
 	r = kvm_arch_init(opaque);
 	if (r)
-		goto out_fail;
+		return r;
 
 	/*
 	 * kvm_arch_init makes sure there's at most one caller
@@ -4685,29 +4685,29 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	 */
 	r = kvm_irqfd_init();
 	if (r)
-		goto out_irqfd;
+		goto out_arch_exit;
 
 	if (!zalloc_cpumask_var(&cpus_hardware_enabled, GFP_KERNEL)) {
 		r = -ENOMEM;
-		goto out_free_0;
+		goto out_irqfd_exit;
 	}
 
 	r = kvm_arch_hardware_setup(opaque);
 	if (r < 0)
-		goto out_free_1;
+		goto out_free_hardware_enabled;
 
 	c.ret = &r;
 	c.opaque = opaque;
 	for_each_online_cpu(cpu) {
 		smp_call_function_single(cpu, check_processor_compat, &c, 1);
 		if (r < 0)
-			goto out_free_2;
+			goto out_free_hardware_unsetup;
 	}
 
 	r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
 				      kvm_starting_cpu, kvm_dying_cpu);
 	if (r)
-		goto out_free_2;
+		goto out_free_hardware_unsetup;
 	register_reboot_notifier(&kvm_reboot_notifier);
 
 	/* A kmem cache lets us meet the alignment requirements of fx_save. */
@@ -4721,12 +4721,12 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 					   NULL);
 	if (!kvm_vcpu_cache) {
 		r = -ENOMEM;
-		goto out_free_3;
+		goto out_free_cpuhp_unregister;
 	}
 
 	r = kvm_async_pf_init();
 	if (r)
-		goto out_free;
+		goto out_free_vcpu_cache;
 
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
@@ -4735,7 +4735,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = misc_register(&kvm_dev);
 	if (r) {
 		pr_err("kvm: misc device register failed\n");
-		goto out_unreg;
+		goto out_async_pf_deinit;
 	}
 
 	register_syscore_ops(&kvm_syscore_ops);
@@ -4750,22 +4750,21 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 
 	return 0;
 
-out_unreg:
+out_async_pf_deinit:
 	kvm_async_pf_deinit();
-out_free:
+out_free_vcpu_cache:
 	kmem_cache_destroy(kvm_vcpu_cache);
-out_free_3:
+out_free_cpuhp_unregister:
 	unregister_reboot_notifier(&kvm_reboot_notifier);
 	cpuhp_remove_state_nocalls(CPUHP_AP_KVM_STARTING);
-out_free_2:
+out_free_hardware_unsetup:
 	kvm_arch_hardware_unsetup();
-out_free_1:
+out_free_hardware_enabled:
 	free_cpumask_var(cpus_hardware_enabled);
-out_free_0:
+out_irqfd_exit:
 	kvm_irqfd_exit();
-out_irqfd:
+out_arch_exit:
 	kvm_arch_exit();
-out_fail:
 	return r;
 }
 EXPORT_SYMBOL_GPL(kvm_init);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/5] KVM: x86: introduce KVM_MEM_ALLONES memory
  2020-05-14 18:05 [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Vitaly Kuznetsov
  2020-05-14 18:05 ` [PATCH RFC 1/5] KVM: rename labels in kvm_init() Vitaly Kuznetsov
@ 2020-05-14 18:05 ` Vitaly Kuznetsov
  2020-05-14 19:18   ` Sean Christopherson
  2020-05-14 18:05 ` [PATCH RFC 3/5] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf() Vitaly Kuznetsov
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-14 18:05 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Michael Tsirkin, Julia Suvorova, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, x86

PCIe config space can (depending on the configuration) be quite big but
usually is sparsely populated. Guest may scan it by accessing individual
device's page which, when device is missing, is supposed to have 'pci
holes' semantics: reads return '0xff' and writes get discarded. Currently,
userspace has to allocate real memory for these holes and fill them with
'0xff'. Moreover, different VMs usually require different memory.

The idea behind the feature introduced by this patch is: let's have a
single read-only page filled with '0xff' in KVM and map it to all such
PCI holes in all VMs. This will free userspace of obligation to allocate
real memory. Later, this will also allow us to speed up access to these
holes as we can aggressively map the whole slot upon first fault.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 Documentation/virt/kvm/api.rst  | 22 ++++++---
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/x86.c              |  9 ++--
 include/linux/kvm_host.h        | 15 ++++++-
 include/uapi/linux/kvm.h        |  2 +
 virt/kvm/kvm_main.c             | 79 +++++++++++++++++++++++++++++++--
 6 files changed, 113 insertions(+), 15 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index d871dacb984e..2b87d588a7e0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1236,7 +1236,8 @@ yet and must be cleared on entry.
 
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
-  #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_READONLY		(1UL << 1)
+  #define KVM_MEM_ALLONES		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1264,12 +1265,19 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
+KVM_MEM_READONLY, KVM_MEM_READONLY:
+- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
+  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
+- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it,
+  to make a new slot read-only.  In this case, writes to this memory will be
+  posted to userspace as KVM_EXIT_MMIO exits.
+- KVM_MEM_ALLONES can be set, if KVM_CAP_ALLONES_MEM capability allows it,
+  to create a new virtual read-only slot which will always return '0xff' when
+  guest reads from it. 'userspace_addr' has to be set to NULL, KVM maps all
+  pages in such slots to a single readonly page. This flag is mutually exclusive
+  with KVM_MEM_LOG_DIRTY_PAGES/KVM_MEM_READONLY. All writes to this memory will be
+  posted to userspace as KVM_EXIT_MMIO exits.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 3f3f780c8c65..97ad1d90636e 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -48,6 +48,7 @@
 #define __KVM_HAVE_XSAVE
 #define __KVM_HAVE_XCRS
 #define __KVM_HAVE_READONLY_MEM
+#define __KVM_HAVE_ALLONES_MEM
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8c0b77ac8dc6..2c0fcab8ea1e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3389,6 +3389,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_GET_MSR_FEATURES:
 	case KVM_CAP_MSR_PLATFORM_INFO:
 	case KVM_CAP_EXCEPTION_PAYLOAD:
+	case KVM_CAP_ALLONES_MEM:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -9962,9 +9963,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
 		ugfn = slot->userspace_addr >> PAGE_SHIFT;
 		/*
 		 * If the gfn and userspace address are not aligned wrt each
-		 * other, disable large page support for this slot.
+		 * other, disable large page support for this slot. Also,
+		 * disable large page support for KVM_MEM_ALLONES slots.
 		 */
-		if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
+		if (slot->flags & KVM_MEM_ALLONES || ((slot->base_gfn ^ ugfn) &
+				      (KVM_PAGES_PER_HPAGE(level) - 1))) {
 			unsigned long j;
 
 			for (j = 0; j < lpages; ++j)
@@ -10021,7 +10024,7 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 				     struct kvm_memory_slot *new)
 {
 	/* Still write protect RO slot */
-	if (new->flags & KVM_MEM_READONLY) {
+	if (new->flags & KVM_MEM_READONLY || new->flags & KVM_MEM_ALLONES) {
 		kvm_mmu_slot_remove_write_access(kvm, new, PT_PAGE_TABLE_LEVEL);
 		return;
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3cc6ccbb1183..47a01ebe019d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1064,10 +1064,23 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
 	return search_memslots(slots, gfn);
 }
 
+#ifdef __KVM_HAVE_ALLONES_MEM
+extern void *kvm_allones_pg;
+#endif
+
 static inline unsigned long
 __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
+	if (likely(!(slot->flags & KVM_MEM_ALLONES))) {
+		return slot->userspace_addr +
+			(gfn - slot->base_gfn) * PAGE_SIZE;
+	} else {
+#ifdef __KVM_HAVE_ALLONES_MEM
+		return (unsigned long)kvm_allones_pg;
+#else
+		BUG();
+#endif
+	}
 }
 
 static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index ac9eba0289d1..2d28890d6c7d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -109,6 +109,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_ALLONES		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1018,6 +1019,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_PROTECTED 180
 #define KVM_CAP_PPC_SECURE_GUEST 181
 #define KVM_CAP_HALT_POLL 182
+#define KVM_CAP_ALLONES_MEM 183
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 892ea0b9087e..95e645c6efbc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -155,6 +155,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+#ifdef __KVM_HAVE_ALLONES_MEM
+void *kvm_allones_pg;
+EXPORT_SYMBOL_GPL(kvm_allones_pg);
+#endif
+
 __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end, bool blockable)
 {
@@ -1039,6 +1044,10 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+#ifdef __KVM_HAVE_ALLONES_MEM
+	valid_flags |= KVM_MEM_ALLONES;
+#endif
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -1216,8 +1225,24 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
 		return -EINVAL;
-	/* We can read the guest memory with __xxx_user() later on. */
-	if ((id < KVM_USER_MEM_SLOTS) &&
+
+	/* KVM_MEM_ALLONES doesn't need userspace_addr */
+	if ((mem->flags & KVM_MEM_ALLONES) && mem->userspace_addr)
+		return -EINVAL;
+
+	/*
+	 * KVM_MEM_ALLONES is mutually exclusive with KVM_MEM_READONLY/
+	 * KVM_MEM_LOG_DIRTY_PAGES.
+	 */
+	if ((mem->flags & KVM_MEM_ALLONES) &&
+	    (mem->flags & (KVM_MEM_READONLY | KVM_MEM_LOG_DIRTY_PAGES)))
+		return -EINVAL;
+
+	/*
+	 * We can read the guest memory with __xxx_user() later on.
+	 * KVM_MEM_ALLONES slots don't have any userspace memory attached.
+	 */
+	if ((id < KVM_USER_MEM_SLOTS) && !(mem->flags & KVM_MEM_ALLONES) &&
 	    ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size)))
@@ -1920,7 +1945,24 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool *async, bool write_fault,
 			       bool *writable)
 {
-	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
+	unsigned long addr;
+
+#ifdef __KVM_HAVE_ALLONES_MEM
+	if (unlikely(slot->flags & KVM_MEM_ALLONES)) {
+		if (writable)
+			*writable = false;
+		if (async)
+			*async = false;
+		if (!write_fault) {
+			addr = vmalloc_to_pfn(kvm_allones_pg);
+			kvm_get_pfn(addr);
+			return addr;
+		} else {
+			return KVM_PFN_NOSLOT;
+		}
+	}
+#endif
+	addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
 	if (addr == KVM_HVA_ERR_RO_BAD) {
 		if (writable)
@@ -4671,6 +4713,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	struct kvm_cpu_compat_check c;
 	int r;
 	int cpu;
+#ifdef __KVM_HAVE_ALLONES_MEM
+	struct page *page;
+#endif
 
 	r = kvm_arch_init(opaque);
 	if (r)
@@ -4728,6 +4773,18 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	if (r)
 		goto out_free_vcpu_cache;
 
+#ifdef __KVM_HAVE_ALLONES_MEM
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		goto out_async_pf_deinit;
+	memset(page_address(page), 0xff, PAGE_SIZE);
+	kvm_allones_pg = vmap(&page, 1, VM_MAP, PAGE_KERNEL_RO);
+	if (!kvm_allones_pg) {
+		__free_page(page);
+		goto out_async_pf_deinit;
+	}
+#endif
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -4735,7 +4792,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = misc_register(&kvm_dev);
 	if (r) {
 		pr_err("kvm: misc device register failed\n");
-		goto out_async_pf_deinit;
+		goto out_free_allones_page;
 	}
 
 	register_syscore_ops(&kvm_syscore_ops);
@@ -4750,6 +4807,12 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 
 	return 0;
 
+out_free_allones_page:
+#ifdef __KVM_HAVE_ALLONES_MEM
+	page = vmalloc_to_page(kvm_allones_pg);
+	vunmap(kvm_allones_pg);
+	__free_page(page);
+#endif
 out_async_pf_deinit:
 	kvm_async_pf_deinit();
 out_free_vcpu_cache:
@@ -4771,8 +4834,16 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
+#ifdef __KVM_HAVE_ALLONES_MEM
+	struct page *page;
+#endif
 	debugfs_remove_recursive(kvm_debugfs_dir);
 	misc_deregister(&kvm_dev);
+#ifdef __KVM_HAVE_ALLONES_MEM
+	page = vmalloc_to_page(kvm_allones_pg);
+	vunmap(kvm_allones_pg);
+	__free_page(page);
+#endif
 	kmem_cache_destroy(kvm_vcpu_cache);
 	kvm_async_pf_deinit();
 	unregister_syscore_ops(&kvm_syscore_ops);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/5] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
  2020-05-14 18:05 [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Vitaly Kuznetsov
  2020-05-14 18:05 ` [PATCH RFC 1/5] KVM: rename labels in kvm_init() Vitaly Kuznetsov
  2020-05-14 18:05 ` [PATCH RFC 2/5] KVM: x86: introduce KVM_MEM_ALLONES memory Vitaly Kuznetsov
@ 2020-05-14 18:05 ` Vitaly Kuznetsov
  2020-05-14 18:05 ` [PATCH RFC 4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots Vitaly Kuznetsov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-14 18:05 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Michael Tsirkin, Julia Suvorova, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, x86

No functional change intended. We will need to analyze slot information
to map PTEs for KVM_MEM_ALLONES slots aggressively.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c         | 14 ++++++++------
 arch/x86/kvm/mmu/paging_tmpl.h |  7 +++++--
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e618472c572b..3db499df2dfc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4078,11 +4078,10 @@ static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
-			 bool *writable)
+static bool try_async_pf(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+			 bool prefault, gfn_t gfn, gpa_t cr2_or_gpa,
+			 kvm_pfn_t *pfn, bool write, bool *writable)
 {
-	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	bool async;
 
 	/* Don't expose private memslots to L2. */
@@ -4118,7 +4117,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	bool exec = error_code & PFERR_FETCH_MASK;
 	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
 	bool map_writable;
-
+	struct kvm_memory_slot *slot;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
 	kvm_pfn_t pfn;
@@ -4140,7 +4139,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+
+	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
+			 &map_writable))
 		return RET_PF_RETRY;
 
 	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index efec7d27b8c5..98e368788e8b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -781,6 +781,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
 	struct guest_walker walker;
+	struct kvm_memory_slot *slot;
 	int r;
 	kvm_pfn_t pfn;
 	unsigned long mmu_seq;
@@ -835,8 +836,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
-			 &map_writable))
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
+
+	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
+			 write_fault, &map_writable))
 		return RET_PF_RETRY;
 
 	if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots
  2020-05-14 18:05 [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Vitaly Kuznetsov
                   ` (2 preceding siblings ...)
  2020-05-14 18:05 ` [PATCH RFC 3/5] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf() Vitaly Kuznetsov
@ 2020-05-14 18:05 ` Vitaly Kuznetsov
  2020-05-14 19:46   ` Sean Christopherson
  2020-05-14 18:05 ` [PATCH RFC 5/5] KVM: selftests: add KVM_MEM_ALLONES test Vitaly Kuznetsov
  2020-05-14 22:05 ` [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Peter Xu
  5 siblings, 1 reply; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-14 18:05 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Michael Tsirkin, Julia Suvorova, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, x86

All PTEs in KVM_MEM_ALLONES slots point to the same read-only page
in KVM so instead of mapping each page upon first access we can map
everything aggressively.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c         | 20 ++++++++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h | 23 +++++++++++++++++++++--
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3db499df2dfc..e92ca9ed3ff5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4154,8 +4154,24 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
-			 prefault, is_tdp && lpage_disallowed);
+
+	if (likely(!(slot->flags & KVM_MEM_ALLONES) || write)) {
+		r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
+				 prefault, is_tdp && lpage_disallowed);
+	} else {
+		/*
+		 * KVM_MEM_ALLONES are 4k only slots fully mapped to the same
+		 * readonly 'allones' page, map all PTEs aggressively here.
+		 */
+		for (gfn = slot->base_gfn; gfn < slot->base_gfn + slot->npages;
+		     gfn++) {
+			r = __direct_map(vcpu, gfn << PAGE_SHIFT, write,
+					 map_writable, max_level, pfn, prefault,
+					 is_tdp && lpage_disallowed);
+			if (r)
+				break;
+		}
+	}
 
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 98e368788e8b..7bf0c48b858f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -789,6 +789,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
 				is_nx_huge_page_enabled();
 	int max_level;
+	gfn_t gfn;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 
@@ -873,8 +874,26 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	r = FNAME(fetch)(vcpu, addr, &walker, write_fault, max_level, pfn,
-			 map_writable, prefault, lpage_disallowed);
+	if (likely(!(slot->flags & KVM_MEM_ALLONES) || write_fault)) {
+		r = FNAME(fetch)(vcpu, addr, &walker, write_fault, max_level,
+				 pfn, map_writable, prefault, lpage_disallowed);
+	} else {
+		/*
+		 * KVM_MEM_ALLONES are 4k only slots fully mapped to the same
+		 * readonly 'allones' page, map all PTEs aggressively here.
+		 */
+		for (gfn = slot->base_gfn; gfn < slot->base_gfn + slot->npages;
+		     gfn++) {
+			walker.gfn = gfn;
+			r = FNAME(fetch)(vcpu, gfn << PAGE_SHIFT, &walker,
+					 write_fault, max_level, pfn,
+					 map_writable, prefault,
+					 lpage_disallowed);
+			if (r)
+				break;
+		}
+	}
+
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 5/5] KVM: selftests: add KVM_MEM_ALLONES test
  2020-05-14 18:05 [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Vitaly Kuznetsov
                   ` (3 preceding siblings ...)
  2020-05-14 18:05 ` [PATCH RFC 4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots Vitaly Kuznetsov
@ 2020-05-14 18:05 ` Vitaly Kuznetsov
  2020-05-14 22:05 ` [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Peter Xu
  5 siblings, 0 replies; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-14 18:05 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Michael Tsirkin, Julia Suvorova, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, x86

Test the newly introduced KVM_MEM_ALLONES memslots:
- Reads from all pages return '0xff'
- Writes to all pages cause KVM_EXIT_MMIO

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  81 +++++++------
 .../kvm/x86_64/memory_region_allones.c        | 112 ++++++++++++++++++
 4 files changed, 162 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/memory_region_allones.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 7af62030c12f..0c9aff445755 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -15,6 +15,7 @@ LIBKVM_s390x = lib/s390x/processor.c lib/s390x/ucall.c
 TEST_GEN_PROGS_x86_64 = x86_64/cr4_cpuid_sync_test
 TEST_GEN_PROGS_x86_64 += x86_64/evmcs_test
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_cpuid
+TEST_GEN_PROGS_x86_64 += x86_64/memory_region_allones
 TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 53b11d725d81..8f5ebc8520b8 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -59,6 +59,7 @@ enum vm_mem_backing_src_type {
 	VM_MEM_SRC_ANONYMOUS,
 	VM_MEM_SRC_ANONYMOUS_THP,
 	VM_MEM_SRC_ANONYMOUS_HUGETLB,
+	VM_MEM_SRC_NONE,
 };
 
 int kvm_check_cap(long cap);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 33ab0a36d230..232b63ba0b4b 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -446,8 +446,11 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
 		    "rc: %i errno: %i", ret, errno);
 
 	sparsebit_free(&region->unused_phy_pages);
-	ret = munmap(region->mmap_start, region->mmap_size);
-	TEST_ASSERT(ret == 0, "munmap failed, rc: %i errno: %i", ret, errno);
+	if (region->mmap_start) {
+		ret = munmap(region->mmap_start, region->mmap_size);
+		TEST_ASSERT(ret == 0, "munmap failed, rc: %i errno: %i", ret,
+			    errno);
+	}
 
 	free(region);
 }
@@ -636,34 +639,42 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	alignment = 1;
 #endif
 
-	if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
-		alignment = max(huge_page_size, alignment);
-
-	/* Add enough memory to align up if necessary */
-	if (alignment > 1)
-		region->mmap_size += alignment;
-
-	region->mmap_start = mmap(NULL, region->mmap_size,
-				  PROT_READ | PROT_WRITE,
-				  MAP_PRIVATE | MAP_ANONYMOUS
-				  | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB ? MAP_HUGETLB : 0),
-				  -1, 0);
-	TEST_ASSERT(region->mmap_start != MAP_FAILED,
-		    "test_malloc failed, mmap_start: %p errno: %i",
-		    region->mmap_start, errno);
-
-	/* Align host address */
-	region->host_mem = align(region->mmap_start, alignment);
-
-	/* As needed perform madvise */
-	if (src_type == VM_MEM_SRC_ANONYMOUS || src_type == VM_MEM_SRC_ANONYMOUS_THP) {
-		ret = madvise(region->host_mem, npages * vm->page_size,
-			     src_type == VM_MEM_SRC_ANONYMOUS ? MADV_NOHUGEPAGE : MADV_HUGEPAGE);
-		TEST_ASSERT(ret == 0, "madvise failed,\n"
-			    "  addr: %p\n"
-			    "  length: 0x%lx\n"
-			    "  src_type: %x",
-			    region->host_mem, npages * vm->page_size, src_type);
+	if (src_type != VM_MEM_SRC_NONE) {
+		if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
+			alignment = max(huge_page_size, alignment);
+
+		/* Add enough memory to align up if necessary */
+		if (alignment > 1)
+			region->mmap_size += alignment;
+
+		region->mmap_start = mmap(NULL, region->mmap_size,
+			  PROT_READ | PROT_WRITE,
+			  MAP_PRIVATE | MAP_ANONYMOUS
+			  | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB ?
+			     MAP_HUGETLB : 0), -1, 0);
+		TEST_ASSERT(region->mmap_start != MAP_FAILED,
+			    "test_malloc failed, mmap_start: %p errno: %i",
+			    region->mmap_start, errno);
+
+		/* Align host address */
+		region->host_mem = align(region->mmap_start, alignment);
+
+		/* As needed perform madvise */
+		if (src_type == VM_MEM_SRC_ANONYMOUS ||
+		    src_type == VM_MEM_SRC_ANONYMOUS_THP) {
+			ret = madvise(region->host_mem, npages * vm->page_size,
+				      src_type == VM_MEM_SRC_ANONYMOUS ?
+				      MADV_NOHUGEPAGE : MADV_HUGEPAGE);
+			TEST_ASSERT(ret == 0, "madvise failed,\n"
+				    "  addr: %p\n"
+				    "  length: 0x%lx\n"
+				    "  src_type: %x",
+				    region->host_mem, npages * vm->page_size,
+				    src_type);
+		}
+	} else {
+		region->mmap_start = NULL;
+		region->host_mem = NULL;
 	}
 
 	region->unused_phy_pages = sparsebit_alloc();
@@ -1069,9 +1080,13 @@ void *addr_gpa2hva(struct kvm_vm *vm, vm_paddr_t gpa)
 	list_for_each_entry(region, &vm->userspace_mem_regions, list) {
 		if ((gpa >= region->region.guest_phys_addr)
 			&& (gpa <= (region->region.guest_phys_addr
-				+ region->region.memory_size - 1)))
-			return (void *) ((uintptr_t) region->host_mem
-				+ (gpa - region->region.guest_phys_addr));
+				+ region->region.memory_size - 1))) {
+			if (region->host_mem)
+				return (void *) ((uintptr_t) region->host_mem
+				 + (gpa - region->region.guest_phys_addr));
+			else
+				return NULL;
+		}
 	}
 
 	TEST_FAIL("No vm physical memory at 0x%lx", gpa);
diff --git a/tools/testing/selftests/kvm/x86_64/memory_region_allones.c b/tools/testing/selftests/kvm/x86_64/memory_region_allones.c
new file mode 100644
index 000000000000..23fec4873422
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/memory_region_allones.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <fcntl.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+
+#include <linux/compiler.h>
+
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+#define VCPU_ID 0
+
+#define MEM_REGION_GPA		0xc0000000
+#define MEM_REGION_SIZE		0x4000
+#define MEM_REGION_SLOT		10
+
+static void guest_code(void)
+{
+	uint8_t val;
+
+	/* First byte in the first page */
+	val = READ_ONCE(*((uint8_t *)MEM_REGION_GPA));
+	GUEST_ASSERT(val == 0xff);
+
+	GUEST_SYNC(1);
+
+	/* Random byte in the second page */
+	val = READ_ONCE(*((uint8_t *)MEM_REGION_GPA + 5000));
+	GUEST_ASSERT(val == 0xff);
+
+	GUEST_SYNC(2);
+
+	/* Write to the first page */
+	WRITE_ONCE(*((uint64_t *)MEM_REGION_GPA + 1024/8), 0xdeafbeef);
+
+	GUEST_SYNC(3);
+
+	/* Write to the second page */
+	WRITE_ONCE(*((uint64_t *)MEM_REGION_GPA + 8000/8), 0xdeafbeef);
+
+	GUEST_SYNC(4);
+
+	GUEST_DONE();
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_vm *vm;
+	struct kvm_run *run;
+	struct ucall uc;
+	int stage, rv;
+
+	rv = kvm_check_cap(KVM_CAP_ALLONES_MEM);
+	if (!rv) {
+		print_skip("KVM_CAP_ALLONES_MEM not supported");
+		exit(KSFT_SKIP);
+	}
+
+	vm = vm_create_default(VCPU_ID, 0, guest_code);
+
+	run = vcpu_state(vm, VCPU_ID);
+
+	vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_NONE,
+				    MEM_REGION_GPA, MEM_REGION_SLOT,
+				    MEM_REGION_SIZE / getpagesize(),
+				    KVM_MEM_ALLONES);
+
+	virt_map(vm, MEM_REGION_GPA, MEM_REGION_GPA,
+		 MEM_REGION_SIZE / getpagesize(), 0);
+
+	for (stage = 1;; stage++) {
+		_vcpu_run(vm, VCPU_ID);
+
+		if (stage == 3 || stage == 5) {
+			TEST_ASSERT(run->exit_reason == KVM_EXIT_MMIO,
+			    "Write to ALLONES page should cause KVM_EXIT_MMIO");
+			continue;
+		}
+
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+			    "Stage %d: unexpected exit reason: %u (%s),\n",
+			    stage, run->exit_reason,
+			    exit_reason_str(run->exit_reason));
+
+		switch (get_ucall(vm, VCPU_ID, &uc)) {
+		case UCALL_ABORT:
+			TEST_FAIL("%s at %s:%ld", (const char *)uc.args[0],
+				  __FILE__, uc.args[1]);
+			/* NOT REACHED */
+		case UCALL_SYNC:
+			break;
+		case UCALL_DONE:
+			goto done;
+		default:
+			TEST_FAIL("Unknown ucall %lu", uc.cmd);
+		}
+	}
+
+done:
+	kvm_vm_free(vm);
+
+	return 0;
+}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/5] KVM: x86: introduce KVM_MEM_ALLONES memory
  2020-05-14 18:05 ` [PATCH RFC 2/5] KVM: x86: introduce KVM_MEM_ALLONES memory Vitaly Kuznetsov
@ 2020-05-14 19:18   ` Sean Christopherson
  2020-05-15  8:24     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2020-05-14 19:18 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, linux-kernel, Michael Tsirkin, Julia Suvorova,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

On Thu, May 14, 2020 at 08:05:37PM +0200, Vitaly Kuznetsov wrote:
> PCIe config space can (depending on the configuration) be quite big but
> usually is sparsely populated. Guest may scan it by accessing individual
> device's page which, when device is missing, is supposed to have 'pci
> holes' semantics: reads return '0xff' and writes get discarded. Currently,
> userspace has to allocate real memory for these holes and fill them with
> '0xff'. Moreover, different VMs usually require different memory.
> 
> The idea behind the feature introduced by this patch is: let's have a
> single read-only page filled with '0xff' in KVM and map it to all such
> PCI holes in all VMs. This will free userspace of obligation to allocate
> real memory. Later, this will also allow us to speed up access to these
> holes as we can aggressively map the whole slot upon first fault.
> 
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  Documentation/virt/kvm/api.rst  | 22 ++++++---
>  arch/x86/include/uapi/asm/kvm.h |  1 +
>  arch/x86/kvm/x86.c              |  9 ++--
>  include/linux/kvm_host.h        | 15 ++++++-
>  include/uapi/linux/kvm.h        |  2 +
>  virt/kvm/kvm_main.c             | 79 +++++++++++++++++++++++++++++++--
>  6 files changed, 113 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index d871dacb984e..2b87d588a7e0 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1236,7 +1236,8 @@ yet and must be cleared on entry.
>  
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> -  #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_READONLY		(1UL << 1)
> +  #define KVM_MEM_ALLONES		(1UL << 2)

Why not call this KVM_MEM_PCI_HOLE or something else that better conveys
that this is memslot is intended to emulate PCI master abort semantics?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots
  2020-05-14 18:05 ` [PATCH RFC 4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots Vitaly Kuznetsov
@ 2020-05-14 19:46   ` Sean Christopherson
  2020-05-15  8:36     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2020-05-14 19:46 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, linux-kernel, Michael Tsirkin, Julia Suvorova,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

On Thu, May 14, 2020 at 08:05:39PM +0200, Vitaly Kuznetsov wrote:
> All PTEs in KVM_MEM_ALLONES slots point to the same read-only page
> in KVM so instead of mapping each page upon first access we can map
> everything aggressively.
> 
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 20 ++++++++++++++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h | 23 +++++++++++++++++++++--
>  2 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3db499df2dfc..e92ca9ed3ff5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4154,8 +4154,24 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  		goto out_unlock;
>  	if (make_mmu_pages_available(vcpu) < 0)
>  		goto out_unlock;
> -	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> -			 prefault, is_tdp && lpage_disallowed);
> +
> +	if (likely(!(slot->flags & KVM_MEM_ALLONES) || write)) {

The 'write' check is wrong.  More specifically, patch 2/5 is missing code
to add KVM_MEM_ALLONES to memslot_is_readonly().  If we end up going with
an actual kvm_allones_pg backing, writes to an ALLONES memslots should be
handled same as writes to RO memslots; MMIO occurs but no MMIO spte is
created.

> +		r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> +				 prefault, is_tdp && lpage_disallowed);
> +	} else {
> +		/*
> +		 * KVM_MEM_ALLONES are 4k only slots fully mapped to the same
> +		 * readonly 'allones' page, map all PTEs aggressively here.
> +		 */
> +		for (gfn = slot->base_gfn; gfn < slot->base_gfn + slot->npages;
> +		     gfn++) {
> +			r = __direct_map(vcpu, gfn << PAGE_SHIFT, write,
> +					 map_writable, max_level, pfn, prefault,
> +					 is_tdp && lpage_disallowed);

IMO this is a waste of memory and TLB entries.  Why not treat the access as
the MMIO it is and emulate the access with a 0xff return value?  I think
it'd be a simple change to have __kvm_read_guest_page() stuff 0xff, i.e. a
kvm_allones_pg wouldn't be needed.  I would even vote to never create an
MMIO SPTE.  The guest has bigger issues if reading from a PCI hole is
performance sensitive.

Regarding memory, looping wantonly on __direct_map() will eventually trigger
the BUG_ON() in mmu_memory_cache_alloc().  mmu_topup_memory_caches() only
ensures there are enough objects available to map a single translation, i.e.
one entry per level, sans the root[*].

[*] The gorilla math in mmu_topup_memory_caches() is horrendously misleading,
    e.g. the '8' pages is really 2*(ROOT_LEVEL - 1), but the 2x part has been
    obsolete for the better part of a decade, and the '- 1' wasn't actually
    originally intended or needed, but is now required because of 5-level
    paging.  I have the beginning of a series to clean up that mess; it was
    low on my todo list because I didn't expect anyone to be mucking with
    related code :-)

> +			if (r)
> +				break;
> +		}
> +	}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory
  2020-05-14 18:05 [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Vitaly Kuznetsov
                   ` (4 preceding siblings ...)
  2020-05-14 18:05 ` [PATCH RFC 5/5] KVM: selftests: add KVM_MEM_ALLONES test Vitaly Kuznetsov
@ 2020-05-14 22:05 ` Peter Xu
  2020-05-14 22:56   ` Sean Christopherson
  5 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2020-05-14 22:05 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, linux-kernel, Michael Tsirkin, Julia Suvorova,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson, x86

On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> The idea of the patchset was suggested by Michael S. Tsirkin.
> 
> PCIe config space can (depending on the configuration) be quite big but
> usually is sparsely populated. Guest may scan it by accessing individual
> device's page which, when device is missing, is supposed to have 'pci
> holes' semantics: reads return '0xff' and writes get discarded. Currently,
> userspace has to allocate real memory for these holes and fill them with
> '0xff'. Moreover, different VMs usually require different memory.
> 
> The idea behind the feature introduced by this patch is: let's have a
> single read-only page filled with '0xff' in KVM and map it to all such
> PCI holes in all VMs. This will free userspace of obligation to allocate
> real memory and also allow us to speed up access to these holes as we
> can aggressively map the whole slot upon first fault.
> 
> RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> 
> Patches are against kvm/next.

Hi, Vitaly,

Could this be done in userspace with existing techniques?

E.g., shm_open() with a handle and fill one 0xff page, then remap it to
anywhere needed in QEMU?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory
  2020-05-14 22:05 ` [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Peter Xu
@ 2020-05-14 22:56   ` Sean Christopherson
  2020-05-14 23:22     ` Peter Xu
  2020-05-15  1:03     ` Andy Lutomirski
  0 siblings, 2 replies; 18+ messages in thread
From: Sean Christopherson @ 2020-05-14 22:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Vitaly Kuznetsov, kvm, linux-kernel, Michael Tsirkin,
	Julia Suvorova, Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> > The idea of the patchset was suggested by Michael S. Tsirkin.
> > 
> > PCIe config space can (depending on the configuration) be quite big but
> > usually is sparsely populated. Guest may scan it by accessing individual
> > device's page which, when device is missing, is supposed to have 'pci
> > holes' semantics: reads return '0xff' and writes get discarded. Currently,
> > userspace has to allocate real memory for these holes and fill them with
> > '0xff'. Moreover, different VMs usually require different memory.
> > 
> > The idea behind the feature introduced by this patch is: let's have a
> > single read-only page filled with '0xff' in KVM and map it to all such
> > PCI holes in all VMs. This will free userspace of obligation to allocate
> > real memory and also allow us to speed up access to these holes as we
> > can aggressively map the whole slot upon first fault.
> > 
> > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> > 
> > Patches are against kvm/next.
> 
> Hi, Vitaly,
> 
> Could this be done in userspace with existing techniques?
> 
> E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> anywhere needed in QEMU?

Mapping that 4k page over and over is going to get expensive, e.g. each
duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
total sum of the holes is >2mb it'll even overflow the mumber of allowed
memslots.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory
  2020-05-14 22:56   ` Sean Christopherson
@ 2020-05-14 23:22     ` Peter Xu
  2020-05-14 23:32       ` Sean Christopherson
  2020-05-15  1:03     ` Andy Lutomirski
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Xu @ 2020-05-14 23:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, kvm, linux-kernel, Michael Tsirkin,
	Julia Suvorova, Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote:
> On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> > > The idea of the patchset was suggested by Michael S. Tsirkin.
> > > 
> > > PCIe config space can (depending on the configuration) be quite big but
> > > usually is sparsely populated. Guest may scan it by accessing individual
> > > device's page which, when device is missing, is supposed to have 'pci
> > > holes' semantics: reads return '0xff' and writes get discarded. Currently,
> > > userspace has to allocate real memory for these holes and fill them with
> > > '0xff'. Moreover, different VMs usually require different memory.
> > > 
> > > The idea behind the feature introduced by this patch is: let's have a
> > > single read-only page filled with '0xff' in KVM and map it to all such
> > > PCI holes in all VMs. This will free userspace of obligation to allocate
> > > real memory and also allow us to speed up access to these holes as we
> > > can aggressively map the whole slot upon first fault.
> > > 
> > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> > > 
> > > Patches are against kvm/next.
> > 
> > Hi, Vitaly,
> > 
> > Could this be done in userspace with existing techniques?
> > 
> > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> > anywhere needed in QEMU?
> 
> Mapping that 4k page over and over is going to get expensive, e.g. each
> duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
> total sum of the holes is >2mb it'll even overflow the mumber of allowed
> memslots.

What's the PTE overhead you mentioned?  We need to fill PTEs one by one on
fault even if the page is allocated in the kernel, am I right?

4K is only an example - we can also use more pages as the template.  However I
guess the kvm memslot count could be a limit..  Could I ask what's the normal
size of this 0xff region, and its distribution?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory
  2020-05-14 23:22     ` Peter Xu
@ 2020-05-14 23:32       ` Sean Christopherson
  2020-05-15  8:42         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2020-05-14 23:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Vitaly Kuznetsov, kvm, linux-kernel, Michael Tsirkin,
	Julia Suvorova, Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

On Thu, May 14, 2020 at 07:22:50PM -0400, Peter Xu wrote:
> On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote:
> > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> > > anywhere needed in QEMU?
> > 
> > Mapping that 4k page over and over is going to get expensive, e.g. each
> > duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
> > total sum of the holes is >2mb it'll even overflow the mumber of allowed
> > memslots.
> 
> What's the PTE overhead you mentioned?  We need to fill PTEs one by one on
> fault even if the page is allocated in the kernel, am I right?

It won't require host PTEs for every page if it's a kernel page.  I doubt
PTEs are a significant overhead, especially compared to memslots, but it's
still worth considering.

My thought was to skimp on both host PTEs _and_ KVM SPTEs by always sending
the PCI hole accesses down the slow MMIO path[*].

[*] https://lkml.kernel.org/r/20200514194624.GB15847@linux.intel.com

> 4K is only an example - we can also use more pages as the template.  However I
> guess the kvm memslot count could be a limit..  Could I ask what's the normal
> size of this 0xff region, and its distribution?
> 
> Thanks,
> 
> -- 
> Peter Xu
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory
  2020-05-14 22:56   ` Sean Christopherson
  2020-05-14 23:22     ` Peter Xu
@ 2020-05-15  1:03     ` Andy Lutomirski
  2020-05-15 11:15       ` Peter Xu
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2020-05-15  1:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, Vitaly Kuznetsov, kvm list, LKML, Michael Tsirkin,
	Julia Suvorova, Paolo Bonzini, Wanpeng Li, Jim Mattson, X86 ML

On Thu, May 14, 2020 at 3:56 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> > > The idea of the patchset was suggested by Michael S. Tsirkin.
> > >
> > > PCIe config space can (depending on the configuration) be quite big but
> > > usually is sparsely populated. Guest may scan it by accessing individual
> > > device's page which, when device is missing, is supposed to have 'pci
> > > holes' semantics: reads return '0xff' and writes get discarded. Currently,
> > > userspace has to allocate real memory for these holes and fill them with
> > > '0xff'. Moreover, different VMs usually require different memory.
> > >
> > > The idea behind the feature introduced by this patch is: let's have a
> > > single read-only page filled with '0xff' in KVM and map it to all such
> > > PCI holes in all VMs. This will free userspace of obligation to allocate
> > > real memory and also allow us to speed up access to these holes as we
> > > can aggressively map the whole slot upon first fault.
> > >
> > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> > >
> > > Patches are against kvm/next.
> >
> > Hi, Vitaly,
> >
> > Could this be done in userspace with existing techniques?
> >
> > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> > anywhere needed in QEMU?
>
> Mapping that 4k page over and over is going to get expensive, e.g. each
> duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
> total sum of the holes is >2mb it'll even overflow the mumber of allowed
> memslots.

How about a tiny character device driver /dev/ones?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/5] KVM: x86: introduce KVM_MEM_ALLONES memory
  2020-05-14 19:18   ` Sean Christopherson
@ 2020-05-15  8:24     ` Vitaly Kuznetsov
  0 siblings, 0 replies; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-15  8:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Michael Tsirkin, Julia Suvorova,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Thu, May 14, 2020 at 08:05:37PM +0200, Vitaly Kuznetsov wrote:
>> PCIe config space can (depending on the configuration) be quite big but
>> usually is sparsely populated. Guest may scan it by accessing individual
>> device's page which, when device is missing, is supposed to have 'pci
>> holes' semantics: reads return '0xff' and writes get discarded. Currently,
>> userspace has to allocate real memory for these holes and fill them with
>> '0xff'. Moreover, different VMs usually require different memory.
>> 
>> The idea behind the feature introduced by this patch is: let's have a
>> single read-only page filled with '0xff' in KVM and map it to all such
>> PCI holes in all VMs. This will free userspace of obligation to allocate
>> real memory. Later, this will also allow us to speed up access to these
>> holes as we can aggressively map the whole slot upon first fault.
>> 
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  Documentation/virt/kvm/api.rst  | 22 ++++++---
>>  arch/x86/include/uapi/asm/kvm.h |  1 +
>>  arch/x86/kvm/x86.c              |  9 ++--
>>  include/linux/kvm_host.h        | 15 ++++++-
>>  include/uapi/linux/kvm.h        |  2 +
>>  virt/kvm/kvm_main.c             | 79 +++++++++++++++++++++++++++++++--
>>  6 files changed, 113 insertions(+), 15 deletions(-)
>> 
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index d871dacb984e..2b87d588a7e0 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -1236,7 +1236,8 @@ yet and must be cleared on entry.
>>  
>>    /* for kvm_memory_region::flags */
>>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>> -  #define KVM_MEM_READONLY	(1UL << 1)
>> +  #define KVM_MEM_READONLY		(1UL << 1)
>> +  #define KVM_MEM_ALLONES		(1UL << 2)
>
> Why not call this KVM_MEM_PCI_HOLE or something else that better conveys
> that this is memslot is intended to emulate PCI master abort semantics?
>

Becuase there's always hope this can be usefult for something else but
PCI? :-) Actually, I was thinking about generalizing this a little bit
to something like KVM_MEM_CONSTANT with a way to set the pattern but I'm
failing to see any need for anything but all-ones or all-zeroes. Maybe
other-than-x86 architectures have some needs?

I'm definitely fine with renaming this to KVM_MEM_PCI_HOLE.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots
  2020-05-14 19:46   ` Sean Christopherson
@ 2020-05-15  8:36     ` Vitaly Kuznetsov
  2020-05-15 13:58       ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-15  8:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Michael Tsirkin, Julia Suvorova,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Thu, May 14, 2020 at 08:05:39PM +0200, Vitaly Kuznetsov wrote:
>> All PTEs in KVM_MEM_ALLONES slots point to the same read-only page
>> in KVM so instead of mapping each page upon first access we can map
>> everything aggressively.
>> 
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/mmu/mmu.c         | 20 ++++++++++++++++++--
>>  arch/x86/kvm/mmu/paging_tmpl.h | 23 +++++++++++++++++++++--
>>  2 files changed, 39 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 3db499df2dfc..e92ca9ed3ff5 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -4154,8 +4154,24 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  		goto out_unlock;
>>  	if (make_mmu_pages_available(vcpu) < 0)
>>  		goto out_unlock;
>> -	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
>> -			 prefault, is_tdp && lpage_disallowed);
>> +
>> +	if (likely(!(slot->flags & KVM_MEM_ALLONES) || write)) {
>
> The 'write' check is wrong.  More specifically, patch 2/5 is missing code
> to add KVM_MEM_ALLONES to memslot_is_readonly().  If we end up going with
> an actual kvm_allones_pg backing, writes to an ALLONES memslots should be
> handled same as writes to RO memslots; MMIO occurs but no MMIO spte is
> created.
>

Missed that, thanks!

>> +		r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
>> +				 prefault, is_tdp && lpage_disallowed);
>> +	} else {
>> +		/*
>> +		 * KVM_MEM_ALLONES are 4k only slots fully mapped to the same
>> +		 * readonly 'allones' page, map all PTEs aggressively here.
>> +		 */
>> +		for (gfn = slot->base_gfn; gfn < slot->base_gfn + slot->npages;
>> +		     gfn++) {
>> +			r = __direct_map(vcpu, gfn << PAGE_SHIFT, write,
>> +					 map_writable, max_level, pfn, prefault,
>> +					 is_tdp && lpage_disallowed);
>
> IMO this is a waste of memory and TLB entries.  Why not treat the access as
> the MMIO it is and emulate the access with a 0xff return value?  I think
> it'd be a simple change to have __kvm_read_guest_page() stuff 0xff, i.e. a
> kvm_allones_pg wouldn't be needed.  I would even vote to never create an
> MMIO SPTE.  The guest has bigger issues if reading from a PCI hole is
> performance sensitive.

You're trying to defeat the sole purpose of the feature :-) I also saw
the option you suggest but Michael convinced me we should go further.

The idea (besides memory waste) was that the time we spend on PCI scan
during boot is significant. Unfortunatelly, I don't have any numbers but
we can certainly try to get them. With this feature (AFAIU) we're not
aiming at 'classic' long-living VMs but rather at something like Kata
containers/FaaS/... where boot time is crucial.

>
> Regarding memory, looping wantonly on __direct_map() will eventually trigger
> the BUG_ON() in mmu_memory_cache_alloc().  mmu_topup_memory_caches() only
> ensures there are enough objects available to map a single translation, i.e.
> one entry per level, sans the root[*].
>
> [*] The gorilla math in mmu_topup_memory_caches() is horrendously misleading,
>     e.g. the '8' pages is really 2*(ROOT_LEVEL - 1), but the 2x part has been
>     obsolete for the better part of a decade, and the '- 1' wasn't actually
>     originally intended or needed, but is now required because of 5-level
>     paging.  I have the beginning of a series to clean up that mess; it was
>     low on my todo list because I didn't expect anyone to be mucking with
>     related code :-)

I missed that too but oh well, this is famous KVM MMU, I should't feel
that bad about it :-) Thanks for your review!

>
>> +			if (r)
>> +				break;
>> +		}
>> +	}
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory
  2020-05-14 23:32       ` Sean Christopherson
@ 2020-05-15  8:42         ` Vitaly Kuznetsov
  0 siblings, 0 replies; 18+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-15  8:42 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: kvm, linux-kernel, Michael Tsirkin, Julia Suvorova,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Thu, May 14, 2020 at 07:22:50PM -0400, Peter Xu wrote:
>> On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote:
>> > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
>> > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
>> > > anywhere needed in QEMU?
>> > 
>> > Mapping that 4k page over and over is going to get expensive, e.g. each
>> > duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
>> > total sum of the holes is >2mb it'll even overflow the mumber of allowed
>> > memslots.
>> 
>> What's the PTE overhead you mentioned?  We need to fill PTEs one by one on
>> fault even if the page is allocated in the kernel, am I right?
>
> It won't require host PTEs for every page if it's a kernel page.  I doubt
> PTEs are a significant overhead, especially compared to memslots, but it's
> still worth considering.
>
> My thought was to skimp on both host PTEs _and_ KVM SPTEs by always sending
> the PCI hole accesses down the slow MMIO path[*].
>
> [*] https://lkml.kernel.org/r/20200514194624.GB15847@linux.intel.com
>

If we drop 'aggressive' patch from this patchset we can probably get
away with KVM_MEM_READONLY and userspace VMAs but this will only help us
to save some memory, it won't speed things up.

>> 4K is only an example - we can also use more pages as the template.  However I
>> guess the kvm memslot count could be a limit..  Could I ask what's the normal
>> size of this 0xff region, and its distribution?

Julia/Michael, could you please provide some 'normal' configuration for
a Q35 machine and its PCIe config space?

-- 
Vitaly


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory
  2020-05-15  1:03     ` Andy Lutomirski
@ 2020-05-15 11:15       ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2020-05-15 11:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Vitaly Kuznetsov, kvm list, LKML,
	Michael Tsirkin, Julia Suvorova, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, X86 ML

On Thu, May 14, 2020 at 06:03:20PM -0700, Andy Lutomirski wrote:
> On Thu, May 14, 2020 at 3:56 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> > > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> > > > The idea of the patchset was suggested by Michael S. Tsirkin.
> > > >
> > > > PCIe config space can (depending on the configuration) be quite big but
> > > > usually is sparsely populated. Guest may scan it by accessing individual
> > > > device's page which, when device is missing, is supposed to have 'pci
> > > > holes' semantics: reads return '0xff' and writes get discarded. Currently,
> > > > userspace has to allocate real memory for these holes and fill them with
> > > > '0xff'. Moreover, different VMs usually require different memory.
> > > >
> > > > The idea behind the feature introduced by this patch is: let's have a
> > > > single read-only page filled with '0xff' in KVM and map it to all such
> > > > PCI holes in all VMs. This will free userspace of obligation to allocate
> > > > real memory and also allow us to speed up access to these holes as we
> > > > can aggressively map the whole slot upon first fault.
> > > >
> > > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> > > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> > > >
> > > > Patches are against kvm/next.
> > >
> > > Hi, Vitaly,
> > >
> > > Could this be done in userspace with existing techniques?
> > >
> > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> > > anywhere needed in QEMU?
> >
> > Mapping that 4k page over and over is going to get expensive, e.g. each
> > duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
> > total sum of the holes is >2mb it'll even overflow the mumber of allowed
> > memslots.
> 
> How about a tiny character device driver /dev/ones?

Yeah, this looks very clean.

Or I also like Sean's idea about using the slow path - I think the answer could
depend on a better knowledge on the problem to solve (PCI scan for small VM
boots) to firstly justify that the fast path is required.  E.g., could we even
workaround that inefficient reading of 0xff's for our use case?  After all what
the BIOS really needs is not those 0xff's, but some other facts.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots
  2020-05-15  8:36     ` Vitaly Kuznetsov
@ 2020-05-15 13:58       ` Sean Christopherson
  0 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2020-05-15 13:58 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, linux-kernel, Michael Tsirkin, Julia Suvorova,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, x86

On Fri, May 15, 2020 at 10:36:19AM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> > IMO this is a waste of memory and TLB entries.  Why not treat the access as
> > the MMIO it is and emulate the access with a 0xff return value?  I think
> > it'd be a simple change to have __kvm_read_guest_page() stuff 0xff, i.e. a
> > kvm_allones_pg wouldn't be needed.  I would even vote to never create an
> > MMIO SPTE.  The guest has bigger issues if reading from a PCI hole is
> > performance sensitive.
> 
> You're trying to defeat the sole purpose of the feature :-) I also saw
> the option you suggest but Michael convinced me we should go further.
> 
> The idea (besides memory waste) was that the time we spend on PCI scan
> during boot is significant.

Put that in the cover letter.  The impression I got from the current cover
letter is that the focus was entirely on memory consumption.

> Unfortunatelly, I don't have any numbers but we can certainly try to get
> them.

Numbers are definitely required, otherwise we'll have no idea whether doing
something like the agressive prefetch actually has a meaningful impact.

> With this feature (AFAIU) we're not aiming at 'classic' long-living VMs but
> rather at something like Kata containers/FaaS/... where boot time is crucial.

Isn't the guest kernel fully controlled by the VMM in those use cases?
Why not enlighten the guest kernel in some way so that it doesn't have to
spend time scanning PCI space in the first place?

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-05-15 13:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-14 18:05 [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Vitaly Kuznetsov
2020-05-14 18:05 ` [PATCH RFC 1/5] KVM: rename labels in kvm_init() Vitaly Kuznetsov
2020-05-14 18:05 ` [PATCH RFC 2/5] KVM: x86: introduce KVM_MEM_ALLONES memory Vitaly Kuznetsov
2020-05-14 19:18   ` Sean Christopherson
2020-05-15  8:24     ` Vitaly Kuznetsov
2020-05-14 18:05 ` [PATCH RFC 3/5] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf() Vitaly Kuznetsov
2020-05-14 18:05 ` [PATCH RFC 4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots Vitaly Kuznetsov
2020-05-14 19:46   ` Sean Christopherson
2020-05-15  8:36     ` Vitaly Kuznetsov
2020-05-15 13:58       ` Sean Christopherson
2020-05-14 18:05 ` [PATCH RFC 5/5] KVM: selftests: add KVM_MEM_ALLONES test Vitaly Kuznetsov
2020-05-14 22:05 ` [PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES memory Peter Xu
2020-05-14 22:56   ` Sean Christopherson
2020-05-14 23:22     ` Peter Xu
2020-05-14 23:32       ` Sean Christopherson
2020-05-15  8:42         ` Vitaly Kuznetsov
2020-05-15  1:03     ` Andy Lutomirski
2020-05-15 11:15       ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).