linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
@ 2020-08-07 14:12 Vitaly Kuznetsov
  2020-08-07 14:12 ` [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf() Vitaly Kuznetsov
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-08-07 14:12 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

Changes since v1:
- Better KVM_SET_USER_MEMORY_REGION flags description, minor tweaks to
  the code [Drew Jones]
- BUG_ON() condition in __gfn_to_hva_memslot() adjusted.

This is a continuation of "[PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES
memory" work: 
https://lore.kernel.org/kvm/20200514180540.52407-1-vkuznets@redhat.com/
and pairs with Julia's "x86/PCI: Use MMCONFIG by default for KVM guests":
https://lore.kernel.org/linux-pci/20200722001513.298315-1-jusual@redhat.com/

PCIe config space can (depending on the configuration) be quite big but
usually is sparsely populated. Guest may scan it by accessing individual
device's page which, when device is missing, is supposed to have 'pci
hole' semantics: reads return '0xff' and writes get discarded.

When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
I observed 8193 accesses to PCI hole memory. When such exit is handled
in KVM without exiting to userspace, it takes roughly 0.000001 sec.
Handling the same exit in userspace is six times slower (0.000006 sec) so
the overal; difference is 0.04 sec. This may be significant for 'microvm'
ideas.

Note, the same speed can already be achieved by using KVM_MEM_READONLY
but doing this would require allocating real memory for all missing
devices and e.g. 8192 pages gives us 32mb. This will have to be allocated
for each guest separately and for 'microvm' use-cases this is likely
a no-go.

Introduce special KVM_MEM_PCI_HOLE memory: userspace doesn't need to
back it with real memory, all reads from it are handled inside KVM and
return '0xff'. Writes still go to userspace but these should be extremely
rare.

The original 'KVM_MEM_ALLONES' idea had additional optimizations: KVM
was mapping all 'PCI hole' pages to a single read-only page stuffed with
0xff. This is omitted in this submission as the benefits are unclear:
KVM will have to allocate SPTEs (either on demand or aggressively) and
this also consumes time/memory. We can always take a look at possible
optimizations later.

Vitaly Kuznetsov (3):
  KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
  KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  KVM: selftests: add KVM_MEM_PCI_HOLE test

 Documentation/virt/kvm/api.rst                |  18 ++-
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  19 +--
 arch/x86/kvm/mmu/paging_tmpl.h                |  10 +-
 arch/x86/kvm/x86.c                            |  10 +-
 include/linux/kvm_host.h                      |   3 +
 include/uapi/linux/kvm.h                      |   2 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  81 +++++++------
 .../kvm/x86_64/memory_slot_pci_hole.c         | 112 ++++++++++++++++++
 virt/kvm/kvm_main.c                           |  39 ++++--
 12 files changed, 239 insertions(+), 58 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/memory_slot_pci_hole.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
  2020-08-07 14:12 [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
@ 2020-08-07 14:12 ` Vitaly Kuznetsov
  2020-08-14  1:40   ` Sean Christopherson
  2020-08-07 14:12 ` [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-08-07 14:12 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

No functional change intended. Slot flags will need to be analyzed
prior to try_async_pf() when KVM_MEM_PCI_HOLE is implemented.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c         | 14 ++++++++------
 arch/x86/kvm/mmu/paging_tmpl.h |  7 +++++--
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 862bf418214e..fef6956393f7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4042,11 +4042,10 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
-			 bool *writable)
+static bool try_async_pf(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+			 bool prefault, gfn_t gfn, gpa_t cr2_or_gpa,
+			 kvm_pfn_t *pfn, bool write, bool *writable)
 {
-	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	bool async;
 
 	/* Don't expose private memslots to L2. */
@@ -4082,7 +4081,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	bool exec = error_code & PFERR_FETCH_MASK;
 	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
 	bool map_writable;
-
+	struct kvm_memory_slot *slot;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
 	kvm_pfn_t pfn;
@@ -4104,7 +4103,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+
+	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
+			 &map_writable))
 		return RET_PF_RETRY;
 
 	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0172a949f6a7..5c6a895f67c3 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -779,6 +779,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
 	struct guest_walker walker;
+	struct kvm_memory_slot *slot;
 	int r;
 	kvm_pfn_t pfn;
 	unsigned long mmu_seq;
@@ -833,8 +834,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
-			 &map_writable))
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
+
+	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
+			 write_fault, &map_writable))
 		return RET_PF_RETRY;
 
 	if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  2020-08-07 14:12 [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
  2020-08-07 14:12 ` [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf() Vitaly Kuznetsov
@ 2020-08-07 14:12 ` Vitaly Kuznetsov
  2020-08-14  2:31   ` Sean Christopherson
  2020-08-07 14:12 ` [PATCH v2 3/3] KVM: selftests: add KVM_MEM_PCI_HOLE test Vitaly Kuznetsov
  2020-08-25 21:25 ` [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Peter Xu
  3 siblings, 1 reply; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-08-07 14:12 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

PCIe config space can (depending on the configuration) be quite big but
usually is sparsely populated. Guest may scan it by accessing individual
device's page which, when device is missing, is supposed to have 'pci
hole' semantics: reads return '0xff' and writes get discarded. Compared
to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
real memory and stuff it with '0xff'.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 Documentation/virt/kvm/api.rst  | 18 ++++++++++-----
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/mmu/mmu.c          |  5 ++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  3 +++
 arch/x86/kvm/x86.c              | 10 ++++++---
 include/linux/kvm_host.h        |  3 +++
 include/uapi/linux/kvm.h        |  2 ++
 virt/kvm/kvm_main.c             | 39 +++++++++++++++++++++++++++------
 8 files changed, 64 insertions(+), 17 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 644e5326aa50..dc4172352635 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PCI_HOLE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
+KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
+- KVM_MEM_LOG_DIRTY_PAGES: log writes.  Use KVM_GET_DIRTY_LOG to retreive
+  the log.
+- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes.  Only
+  available when KVM_CAP_READONLY_MEM is present.
+- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
+  KVM_EXIT_MMIO on writes.  Only available when KVM_CAP_PCI_HOLE_MEM is
+  present.  When setting the memory region 'userspace_addr' must be NULL.
+  This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
+  KVM_MEM_READONLY.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 17c5a038f42d..cf80a26d74f5 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -48,6 +48,7 @@
 #define __KVM_HAVE_XSAVE
 #define __KVM_HAVE_XCRS
 #define __KVM_HAVE_READONLY_MEM
+#define __KVM_HAVE_PCI_HOLE_MEM
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fef6956393f7..4a2a7fface1e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 		return PG_LEVEL_4K;
 
 	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
-	if (!slot)
+	if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))
 		return PG_LEVEL_4K;
 
 	max_level = min(max_level, max_huge_page_level);
@@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
+	if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))
+		return RET_PF_EMULATE;
+
 	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
 			 &map_writable))
 		return RET_PF_RETRY;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5c6a895f67c3..27abd69e69f6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
 
+	if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
+		return RET_PF_EMULATE;
+
 	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
 			 write_fault, &map_writable))
 		return RET_PF_RETRY;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc4370394ab8..538bc58a22db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_EXCEPTION_PAYLOAD:
 	case KVM_CAP_SET_GUEST_DEBUG:
 	case KVM_CAP_LAST_CPU:
+	case KVM_CAP_PCI_HOLE_MEM:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
 		ugfn = slot->userspace_addr >> PAGE_SHIFT;
 		/*
 		 * If the gfn and userspace address are not aligned wrt each
-		 * other, disable large page support for this slot.
+		 * other, disable large page support for this slot. Also,
+		 * disable large page support for KVM_MEM_PCI_HOLE slots.
 		 */
-		if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
+		if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
+				      (KVM_PAGES_PER_HPAGE(level) - 1))) {
 			unsigned long j;
 
 			for (j = 0; j < lpages; ++j)
@@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 	 * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
 	 * See comments below.
 	 */
-	if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
+	if ((change != KVM_MR_FLAGS_ONLY) ||
+	    (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
 		return;
 
 	/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 989afcbe642f..de1faa64a8ef 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
 static inline unsigned long
 __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
+	/* Should never be called with a KVM_MEM_PCI_HOLE slot */
+	BUG_ON(!slot->userspace_addr);
+
 	return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
 }
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6d86033c4fa..b56137f4f96f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -109,6 +109,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PCI_HOLE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1035,6 +1036,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_LAST_CPU 184
 #define KVM_CAP_SMALLER_MAXPHYADDR 185
 #define KVM_CAP_S390_DIAG318 186
+#define KVM_CAP_PCI_HOLE_MEM 187
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2c2c0254c2d8..3f69ae711021 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1107,6 +1107,10 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+#ifdef __KVM_HAVE_PCI_HOLE_MEM
+	valid_flags |= KVM_MEM_PCI_HOLE;
+#endif
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -1284,11 +1288,26 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
 		return -EINVAL;
-	/* We can read the guest memory with __xxx_user() later on. */
-	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
-	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
-			mem->memory_size))
+
+	/*
+	 * KVM_MEM_PCI_HOLE is mutually exclusive with KVM_MEM_READONLY/
+	 * KVM_MEM_LOG_DIRTY_PAGES.
+	 */
+	if ((mem->flags & KVM_MEM_PCI_HOLE) &&
+	    (mem->flags & (KVM_MEM_READONLY | KVM_MEM_LOG_DIRTY_PAGES)))
 		return -EINVAL;
+
+	if (!(mem->flags & KVM_MEM_PCI_HOLE)) {
+		/* We can read the guest memory with __xxx_user() later on. */
+		if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
+		    !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+			       mem->memory_size))
+			return -EINVAL;
+	} else {
+		if (mem->userspace_addr)
+			return -EINVAL;
+	}
+
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -1328,7 +1347,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	} else { /* Modify an existing slot. */
 		if ((new.userspace_addr != old.userspace_addr) ||
 		    (new.npages != old.npages) ||
-		    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
+		    ((new.flags ^ old.flags) & KVM_MEM_READONLY) ||
+		    ((new.flags ^ old.flags) & KVM_MEM_PCI_HOLE))
 			return -EINVAL;
 
 		if (new.base_gfn != old.base_gfn)
@@ -1715,13 +1735,13 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
 
 static bool memslot_is_readonly(struct kvm_memory_slot *slot)
 {
-	return slot->flags & KVM_MEM_READONLY;
+	return slot->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE);
 }
 
 static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				       gfn_t *nr_pages, bool write)
 {
-	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
+	if (!slot || (slot->flags & (KVM_MEMSLOT_INVALID | KVM_MEM_PCI_HOLE)))
 		return KVM_HVA_ERR_BAD;
 
 	if (memslot_is_readonly(slot) && write)
@@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 	int r;
 	unsigned long addr;
 
+	if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
+		memset(data, 0xff, len);
+		return 0;
+	}
+
 	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/3] KVM: selftests: add KVM_MEM_PCI_HOLE test
  2020-08-07 14:12 [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
  2020-08-07 14:12 ` [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf() Vitaly Kuznetsov
  2020-08-07 14:12 ` [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
@ 2020-08-07 14:12 ` Vitaly Kuznetsov
  2020-08-25 21:25 ` [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Peter Xu
  3 siblings, 0 replies; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-08-07 14:12 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

Test the newly introduced KVM_MEM_PCI_HOLE memslots:
- Reads from all pages return '0xff'
- Writes to all pages cause KVM_EXIT_MMIO

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  81 +++++++------
 .../kvm/x86_64/memory_slot_pci_hole.c         | 112 ++++++++++++++++++
 4 files changed, 162 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/memory_slot_pci_hole.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 4a166588d99f..a6fe303fbf6a 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -41,6 +41,7 @@ LIBKVM_s390x = lib/s390x/processor.c lib/s390x/ucall.c
 TEST_GEN_PROGS_x86_64 = x86_64/cr4_cpuid_sync_test
 TEST_GEN_PROGS_x86_64 += x86_64/evmcs_test
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_cpuid
+TEST_GEN_PROGS_x86_64 += x86_64/memory_slot_pci_hole
 TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 919e161dd289..8e7bec7bd287 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -59,6 +59,7 @@ enum vm_mem_backing_src_type {
 	VM_MEM_SRC_ANONYMOUS,
 	VM_MEM_SRC_ANONYMOUS_THP,
 	VM_MEM_SRC_ANONYMOUS_HUGETLB,
+	VM_MEM_SRC_NONE,
 };
 
 int kvm_check_cap(long cap);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 74776ee228f2..46bb28ea34ec 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -453,8 +453,11 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
 		    "rc: %i errno: %i", ret, errno);
 
 	sparsebit_free(&region->unused_phy_pages);
-	ret = munmap(region->mmap_start, region->mmap_size);
-	TEST_ASSERT(ret == 0, "munmap failed, rc: %i errno: %i", ret, errno);
+	if (region->mmap_start) {
+		ret = munmap(region->mmap_start, region->mmap_size);
+		TEST_ASSERT(ret == 0, "munmap failed, rc: %i errno: %i", ret,
+			    errno);
+	}
 
 	free(region);
 }
@@ -643,34 +646,42 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	alignment = 1;
 #endif
 
-	if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
-		alignment = max(huge_page_size, alignment);
-
-	/* Add enough memory to align up if necessary */
-	if (alignment > 1)
-		region->mmap_size += alignment;
-
-	region->mmap_start = mmap(NULL, region->mmap_size,
-				  PROT_READ | PROT_WRITE,
-				  MAP_PRIVATE | MAP_ANONYMOUS
-				  | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB ? MAP_HUGETLB : 0),
-				  -1, 0);
-	TEST_ASSERT(region->mmap_start != MAP_FAILED,
-		    "test_malloc failed, mmap_start: %p errno: %i",
-		    region->mmap_start, errno);
-
-	/* Align host address */
-	region->host_mem = align(region->mmap_start, alignment);
-
-	/* As needed perform madvise */
-	if (src_type == VM_MEM_SRC_ANONYMOUS || src_type == VM_MEM_SRC_ANONYMOUS_THP) {
-		ret = madvise(region->host_mem, npages * vm->page_size,
-			     src_type == VM_MEM_SRC_ANONYMOUS ? MADV_NOHUGEPAGE : MADV_HUGEPAGE);
-		TEST_ASSERT(ret == 0, "madvise failed,\n"
-			    "  addr: %p\n"
-			    "  length: 0x%lx\n"
-			    "  src_type: %x",
-			    region->host_mem, npages * vm->page_size, src_type);
+	if (src_type != VM_MEM_SRC_NONE) {
+		if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
+			alignment = max(huge_page_size, alignment);
+
+		/* Add enough memory to align up if necessary */
+		if (alignment > 1)
+			region->mmap_size += alignment;
+
+		region->mmap_start = mmap(NULL, region->mmap_size,
+			  PROT_READ | PROT_WRITE,
+			  MAP_PRIVATE | MAP_ANONYMOUS
+			  | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB ?
+			     MAP_HUGETLB : 0), -1, 0);
+		TEST_ASSERT(region->mmap_start != MAP_FAILED,
+			    "test_malloc failed, mmap_start: %p errno: %i",
+			    region->mmap_start, errno);
+
+		/* Align host address */
+		region->host_mem = align(region->mmap_start, alignment);
+
+		/* As needed perform madvise */
+		if (src_type == VM_MEM_SRC_ANONYMOUS ||
+		    src_type == VM_MEM_SRC_ANONYMOUS_THP) {
+			ret = madvise(region->host_mem, npages * vm->page_size,
+				      src_type == VM_MEM_SRC_ANONYMOUS ?
+				      MADV_NOHUGEPAGE : MADV_HUGEPAGE);
+			TEST_ASSERT(ret == 0, "madvise failed,\n"
+				    "  addr: %p\n"
+				    "  length: 0x%lx\n"
+				    "  src_type: %x",
+				    region->host_mem, npages * vm->page_size,
+				    src_type);
+		}
+	} else {
+		region->mmap_start = NULL;
+		region->host_mem = NULL;
 	}
 
 	region->unused_phy_pages = sparsebit_alloc();
@@ -1076,9 +1087,13 @@ void *addr_gpa2hva(struct kvm_vm *vm, vm_paddr_t gpa)
 	list_for_each_entry(region, &vm->userspace_mem_regions, list) {
 		if ((gpa >= region->region.guest_phys_addr)
 			&& (gpa <= (region->region.guest_phys_addr
-				+ region->region.memory_size - 1)))
-			return (void *) ((uintptr_t) region->host_mem
-				+ (gpa - region->region.guest_phys_addr));
+				+ region->region.memory_size - 1))) {
+			if (region->host_mem)
+				return (void *) ((uintptr_t) region->host_mem
+				 + (gpa - region->region.guest_phys_addr));
+			else
+				return NULL;
+		}
 	}
 
 	TEST_FAIL("No vm physical memory at 0x%lx", gpa);
diff --git a/tools/testing/selftests/kvm/x86_64/memory_slot_pci_hole.c b/tools/testing/selftests/kvm/x86_64/memory_slot_pci_hole.c
new file mode 100644
index 000000000000..f5fa80dfcba7
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/memory_slot_pci_hole.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <fcntl.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+
+#include <linux/compiler.h>
+
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+#define VCPU_ID 0
+
+#define MEM_REGION_GPA		0xc0000000
+#define MEM_REGION_SIZE		0x4000
+#define MEM_REGION_SLOT		10
+
+static void guest_code(void)
+{
+	uint8_t val;
+
+	/* First byte in the first page */
+	val = READ_ONCE(*((uint8_t *)MEM_REGION_GPA));
+	GUEST_ASSERT(val == 0xff);
+
+	GUEST_SYNC(1);
+
+	/* Random byte in the second page */
+	val = READ_ONCE(*((uint8_t *)MEM_REGION_GPA + 5000));
+	GUEST_ASSERT(val == 0xff);
+
+	GUEST_SYNC(2);
+
+	/* Write to the first page */
+	WRITE_ONCE(*((uint64_t *)MEM_REGION_GPA + 1024/8), 0xdeafbeef);
+
+	GUEST_SYNC(3);
+
+	/* Write to the second page */
+	WRITE_ONCE(*((uint64_t *)MEM_REGION_GPA + 8000/8), 0xdeafbeef);
+
+	GUEST_SYNC(4);
+
+	GUEST_DONE();
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_vm *vm;
+	struct kvm_run *run;
+	struct ucall uc;
+	int stage, rv;
+
+	rv = kvm_check_cap(KVM_CAP_PCI_HOLE_MEM);
+	if (!rv) {
+		print_skip("KVM_CAP_PCI_HOLE_MEM not supported");
+		exit(KSFT_SKIP);
+	}
+
+	vm = vm_create_default(VCPU_ID, 0, guest_code);
+
+	run = vcpu_state(vm, VCPU_ID);
+
+	vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_NONE,
+				    MEM_REGION_GPA, MEM_REGION_SLOT,
+				    MEM_REGION_SIZE / getpagesize(),
+				    KVM_MEM_PCI_HOLE);
+
+	virt_map(vm, MEM_REGION_GPA, MEM_REGION_GPA,
+		 MEM_REGION_SIZE / getpagesize(), 0);
+
+	for (stage = 1;; stage++) {
+		_vcpu_run(vm, VCPU_ID);
+
+		if (stage == 3 || stage == 5) {
+			TEST_ASSERT(run->exit_reason == KVM_EXIT_MMIO,
+			   "Write to PCI_HOLE page should cause KVM_EXIT_MMIO");
+			continue;
+		}
+
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+			    "Stage %d: unexpected exit reason: %u (%s),\n",
+			    stage, run->exit_reason,
+			    exit_reason_str(run->exit_reason));
+
+		switch (get_ucall(vm, VCPU_ID, &uc)) {
+		case UCALL_ABORT:
+			TEST_FAIL("%s at %s:%ld", (const char *)uc.args[0],
+				  __FILE__, uc.args[1]);
+			/* NOT REACHED */
+		case UCALL_SYNC:
+			break;
+		case UCALL_DONE:
+			goto done;
+		default:
+			TEST_FAIL("Unknown ucall %lu", uc.cmd);
+		}
+	}
+
+done:
+	kvm_vm_free(vm);
+
+	return 0;
+}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
  2020-08-07 14:12 ` [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf() Vitaly Kuznetsov
@ 2020-08-14  1:40   ` Sean Christopherson
  2020-09-01 14:15     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Christopherson @ 2020-08-14  1:40 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Fri, Aug 07, 2020 at 04:12:30PM +0200, Vitaly Kuznetsov wrote:
> No functional change intended. Slot flags will need to be analyzed
> prior to try_async_pf() when KVM_MEM_PCI_HOLE is implemented.

Why?  Wouldn't it be just as easy, and arguably more appropriate, to add
KVM_PFN_ERR_PCI_HOLE and update handle_abornmal_pfn() accordinaly?

> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 14 ++++++++------
>  arch/x86/kvm/mmu/paging_tmpl.h |  7 +++++--
>  2 files changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 862bf418214e..fef6956393f7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4042,11 +4042,10 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
>  }
>  
> -static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> -			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
> -			 bool *writable)
> +static bool try_async_pf(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> +			 bool prefault, gfn_t gfn, gpa_t cr2_or_gpa,
> +			 kvm_pfn_t *pfn, bool write, bool *writable)
>  {
> -	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  	bool async;
>  
>  	/* Don't expose private memslots to L2. */
> @@ -4082,7 +4081,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	bool exec = error_code & PFERR_FETCH_MASK;
>  	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
>  	bool map_writable;
> -
> +	struct kvm_memory_slot *slot;
>  	gfn_t gfn = gpa >> PAGE_SHIFT;
>  	unsigned long mmu_seq;
>  	kvm_pfn_t pfn;
> @@ -4104,7 +4103,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
>  
> -	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
> +	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +
> +	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
> +			 &map_writable))
>  		return RET_PF_RETRY;
>  
>  	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 0172a949f6a7..5c6a895f67c3 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -779,6 +779,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>  	int write_fault = error_code & PFERR_WRITE_MASK;
>  	int user_fault = error_code & PFERR_USER_MASK;
>  	struct guest_walker walker;
> +	struct kvm_memory_slot *slot;
>  	int r;
>  	kvm_pfn_t pfn;
>  	unsigned long mmu_seq;
> @@ -833,8 +834,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
>  
> -	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
> -			 &map_writable))
> +	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
> +
> +	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
> +			 write_fault, &map_writable))
>  		return RET_PF_RETRY;
>  
>  	if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  2020-08-07 14:12 ` [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
@ 2020-08-14  2:31   ` Sean Christopherson
  2020-08-14 14:30     ` Michael S. Tsirkin
  2020-09-01 14:39     ` Vitaly Kuznetsov
  0 siblings, 2 replies; 31+ messages in thread
From: Sean Christopherson @ 2020-08-14  2:31 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Fri, Aug 07, 2020 at 04:12:31PM +0200, Vitaly Kuznetsov wrote:
> PCIe config space can (depending on the configuration) be quite big but
> usually is sparsely populated. Guest may scan it by accessing individual
> device's page which, when device is missing, is supposed to have 'pci
> hole' semantics: reads return '0xff' and writes get discarded. Compared
> to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
> real memory and stuff it with '0xff'.
> 
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  Documentation/virt/kvm/api.rst  | 18 ++++++++++-----
>  arch/x86/include/uapi/asm/kvm.h |  1 +
>  arch/x86/kvm/mmu/mmu.c          |  5 ++++-
>  arch/x86/kvm/mmu/paging_tmpl.h  |  3 +++
>  arch/x86/kvm/x86.c              | 10 ++++++---
>  include/linux/kvm_host.h        |  3 +++
>  include/uapi/linux/kvm.h        |  2 ++
>  virt/kvm/kvm_main.c             | 39 +++++++++++++++++++++++++++------
>  8 files changed, 64 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 644e5326aa50..dc4172352635 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_PCI_HOLE		(1UL << 2)
>  
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>  
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
> +KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
> +- KVM_MEM_LOG_DIRTY_PAGES: log writes.  Use KVM_GET_DIRTY_LOG to retreive
> +  the log.
> +- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes.  Only
> +  available when KVM_CAP_READONLY_MEM is present.
> +- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
> +  KVM_EXIT_MMIO on writes.  Only available when KVM_CAP_PCI_HOLE_MEM is
> +  present.  When setting the memory region 'userspace_addr' must be NULL.
> +  This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
> +  KVM_MEM_READONLY.
>  
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 17c5a038f42d..cf80a26d74f5 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -48,6 +48,7 @@
>  #define __KVM_HAVE_XSAVE
>  #define __KVM_HAVE_XCRS
>  #define __KVM_HAVE_READONLY_MEM
> +#define __KVM_HAVE_PCI_HOLE_MEM
>  
>  /* Architectural interrupt line count. */
>  #define KVM_NR_INTERRUPTS 256
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index fef6956393f7..4a2a7fface1e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
>  		return PG_LEVEL_4K;
>  
>  	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
> -	if (!slot)
> +	if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))

This is unnecessary since you're setting disallow_lpage in
kvm_alloc_memslot_metadata().

>  		return PG_LEVEL_4K;
>  
>  	max_level = min(max_level, max_huge_page_level);
> @@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  
>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> +	if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))

I'm confused.  Why does this short circuit reads but not writes?

> +		return RET_PF_EMULATE;
> +
>  	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
>  			 &map_writable))
>  		return RET_PF_RETRY;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 5c6a895f67c3..27abd69e69f6 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>  
>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
>  
> +	if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> +		return RET_PF_EMULATE;
> +
>  	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
>  			 write_fault, &map_writable))
>  		return RET_PF_RETRY;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index dc4370394ab8..538bc58a22db 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_EXCEPTION_PAYLOAD:
>  	case KVM_CAP_SET_GUEST_DEBUG:
>  	case KVM_CAP_LAST_CPU:
> +	case KVM_CAP_PCI_HOLE_MEM:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SYNC_REGS:
> @@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
>  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
>  		/*
>  		 * If the gfn and userspace address are not aligned wrt each
> -		 * other, disable large page support for this slot.
> +		 * other, disable large page support for this slot. Also,
> +		 * disable large page support for KVM_MEM_PCI_HOLE slots.
>  		 */
> -		if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
> +		if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
> +				      (KVM_PAGES_PER_HPAGE(level) - 1))) {
>  			unsigned long j;
>  
>  			for (j = 0; j < lpages; ++j)
> @@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  	 * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
>  	 * See comments below.
>  	 */
> -	if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
> +	if ((change != KVM_MR_FLAGS_ONLY) ||
> +	    (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
>  		return;
>  
>  	/*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 989afcbe642f..de1faa64a8ef 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>  static inline unsigned long
>  __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> +	/* Should never be called with a KVM_MEM_PCI_HOLE slot */
> +	BUG_ON(!slot->userspace_addr);

So _technically_, userspace can hit this by allowing virtual address 0,
which is very much non-standard, but theoretically legal.  It'd probably be
better to use a value that can't possibly be a valid userspace_addr, e.g. a
non-canonical value.

> +
>  	return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
>  }
>  

...

> @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>  	int r;
>  	unsigned long addr;
>  
> +	if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> +		memset(data, 0xff, len);
> +		return 0;
> +	}

This feels wrong, shouldn't we be treating PCI_HOLE as MMIO?  Given that
this is performance oriented, I would think we'd want to leverage the
GPA from the VMCS instead of doing a full translation.

That brings up a potential alternative to adding a memslot flag.  What if
we instead add a KVM_MMIO_BUS device similar to coalesced MMIO?  I think
it'd be about the same amount of KVM code, and it would provide userspace
with more flexibility, e.g. I assume it would allow handling even writes
wholly within the kernel for certain ranges and/or use cases, and it'd
allow stuffing a value other than 0xff (though I have no idea if there is
a use case for this).

Speaking of which, why do writes go to userspace in this series?

> +
>  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  2020-08-14  2:31   ` Sean Christopherson
@ 2020-08-14 14:30     ` Michael S. Tsirkin
  2020-08-17 16:32       ` Sean Christopherson
  2020-09-01 14:39     ` Vitaly Kuznetsov
  1 sibling, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2020-08-14 14:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Peter Xu, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Thu, Aug 13, 2020 at 07:31:39PM -0700, Sean Christopherson wrote:
> On Fri, Aug 07, 2020 at 04:12:31PM +0200, Vitaly Kuznetsov wrote:
> > PCIe config space can (depending on the configuration) be quite big but
> > usually is sparsely populated. Guest may scan it by accessing individual
> > device's page which, when device is missing, is supposed to have 'pci
> > hole' semantics: reads return '0xff' and writes get discarded. Compared
> > to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
> > real memory and stuff it with '0xff'.
> > 
> > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > ---
> >  Documentation/virt/kvm/api.rst  | 18 ++++++++++-----
> >  arch/x86/include/uapi/asm/kvm.h |  1 +
> >  arch/x86/kvm/mmu/mmu.c          |  5 ++++-
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  3 +++
> >  arch/x86/kvm/x86.c              | 10 ++++++---
> >  include/linux/kvm_host.h        |  3 +++
> >  include/uapi/linux/kvm.h        |  2 ++
> >  virt/kvm/kvm_main.c             | 39 +++++++++++++++++++++++++++------
> >  8 files changed, 64 insertions(+), 17 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 644e5326aa50..dc4172352635 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
> >    /* for kvm_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> > +  #define KVM_MEM_PCI_HOLE		(1UL << 2)
> >  
> >  This ioctl allows the user to create, modify or delete a guest physical
> >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> >  be identical.  This allows large pages in the guest to be backed by large
> >  pages in the host.
> >  
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > -to make a new slot read-only.  In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
> > +KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
> > +- KVM_MEM_LOG_DIRTY_PAGES: log writes.  Use KVM_GET_DIRTY_LOG to retreive
> > +  the log.
> > +- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes.  Only
> > +  available when KVM_CAP_READONLY_MEM is present.
> > +- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
> > +  KVM_EXIT_MMIO on writes.  Only available when KVM_CAP_PCI_HOLE_MEM is
> > +  present.  When setting the memory region 'userspace_addr' must be NULL.
> > +  This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
> > +  KVM_MEM_READONLY.
> >  
> >  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> >  the memory region are automatically reflected into the guest.  For example, an
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 17c5a038f42d..cf80a26d74f5 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -48,6 +48,7 @@
> >  #define __KVM_HAVE_XSAVE
> >  #define __KVM_HAVE_XCRS
> >  #define __KVM_HAVE_READONLY_MEM
> > +#define __KVM_HAVE_PCI_HOLE_MEM
> >  
> >  /* Architectural interrupt line count. */
> >  #define KVM_NR_INTERRUPTS 256
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index fef6956393f7..4a2a7fface1e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
> >  		return PG_LEVEL_4K;
> >  
> >  	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
> > -	if (!slot)
> > +	if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))
> 
> This is unnecessary since you're setting disallow_lpage in
> kvm_alloc_memslot_metadata().
> 
> >  		return PG_LEVEL_4K;
> >  
> >  	max_level = min(max_level, max_huge_page_level);
> > @@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >  
> >  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >  
> > +	if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> 
> I'm confused.  Why does this short circuit reads but not writes?
> 
> > +		return RET_PF_EMULATE;
> > +
> >  	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
> >  			 &map_writable))
> >  		return RET_PF_RETRY;
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 5c6a895f67c3..27abd69e69f6 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
> >  
> >  	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
> >  
> > +	if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> > +		return RET_PF_EMULATE;
> > +
> >  	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
> >  			 write_fault, &map_writable))
> >  		return RET_PF_RETRY;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index dc4370394ab8..538bc58a22db 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  	case KVM_CAP_EXCEPTION_PAYLOAD:
> >  	case KVM_CAP_SET_GUEST_DEBUG:
> >  	case KVM_CAP_LAST_CPU:
> > +	case KVM_CAP_PCI_HOLE_MEM:
> >  		r = 1;
> >  		break;
> >  	case KVM_CAP_SYNC_REGS:
> > @@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
> >  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> >  		/*
> >  		 * If the gfn and userspace address are not aligned wrt each
> > -		 * other, disable large page support for this slot.
> > +		 * other, disable large page support for this slot. Also,
> > +		 * disable large page support for KVM_MEM_PCI_HOLE slots.
> >  		 */
> > -		if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
> > +		if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
> > +				      (KVM_PAGES_PER_HPAGE(level) - 1))) {
> >  			unsigned long j;
> >  
> >  			for (j = 0; j < lpages; ++j)
> > @@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >  	 * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
> >  	 * See comments below.
> >  	 */
> > -	if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
> > +	if ((change != KVM_MR_FLAGS_ONLY) ||
> > +	    (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
> >  		return;
> >  
> >  	/*
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 989afcbe642f..de1faa64a8ef 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> >  static inline unsigned long
> >  __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> >  {
> > +	/* Should never be called with a KVM_MEM_PCI_HOLE slot */
> > +	BUG_ON(!slot->userspace_addr);
> 
> So _technically_, userspace can hit this by allowing virtual address 0,
> which is very much non-standard, but theoretically legal.  It'd probably be
> better to use a value that can't possibly be a valid userspace_addr, e.g. a
> non-canonical value.
> 
> > +
> >  	return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
> >  }
> >  
> 
> ...
> 
> > @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> >  	int r;
> >  	unsigned long addr;
> >  
> > +	if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> > +		memset(data, 0xff, len);
> > +		return 0;
> > +	}
> 
> This feels wrong, shouldn't we be treating PCI_HOLE as MMIO?  Given that
> this is performance oriented, I would think we'd want to leverage the
> GPA from the VMCS instead of doing a full translation.
> 
> That brings up a potential alternative to adding a memslot flag.  What if
> we instead add a KVM_MMIO_BUS device similar to coalesced MMIO?  I think
> it'd be about the same amount of KVM code, and it would provide userspace
> with more flexibility, e.g. I assume it would allow handling even writes
> wholly within the kernel for certain ranges and/or use cases, and it'd
> allow stuffing a value other than 0xff (though I have no idea if there is
> a use case for this).

I still think down the road the way to go is to map
valid RO page full of 0xff to avoid exit on read.
I don't think a KVM_MMIO_BUS device will allow this, will it?


> Speaking of which, why do writes go to userspace in this series?
> 
> > +
> >  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
> >  	if (kvm_is_error_hva(addr))
> >  		return -EFAULT;
> > -- 
> > 2.25.4
> > 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  2020-08-14 14:30     ` Michael S. Tsirkin
@ 2020-08-17 16:32       ` Sean Christopherson
  2020-08-21  1:46         ` Michael S. Tsirkin
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Christopherson @ 2020-08-17 16:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vitaly Kuznetsov, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Peter Xu, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Fri, Aug 14, 2020 at 10:30:14AM -0400, Michael S. Tsirkin wrote:
> On Thu, Aug 13, 2020 at 07:31:39PM -0700, Sean Christopherson wrote:
> > > @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > >  	int r;
> > >  	unsigned long addr;
> > >  
> > > +	if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> > > +		memset(data, 0xff, len);
> > > +		return 0;
> > > +	}
> > 
> > This feels wrong, shouldn't we be treating PCI_HOLE as MMIO?  Given that
> > this is performance oriented, I would think we'd want to leverage the
> > GPA from the VMCS instead of doing a full translation.
> > 
> > That brings up a potential alternative to adding a memslot flag.  What if
> > we instead add a KVM_MMIO_BUS device similar to coalesced MMIO?  I think
> > it'd be about the same amount of KVM code, and it would provide userspace
> > with more flexibility, e.g. I assume it would allow handling even writes
> > wholly within the kernel for certain ranges and/or use cases, and it'd
> > allow stuffing a value other than 0xff (though I have no idea if there is
> > a use case for this).
> 
> I still think down the road the way to go is to map
> valid RO page full of 0xff to avoid exit on read.
> I don't think a KVM_MMIO_BUS device will allow this, will it?

No, it would not, but adding KVM_MEM_PCI_HOLE doesn't get us any closer to
solving that problem either.

What if we add a flag to allow routing all GFNs in a memslot to a single
HVA?  At a glance, it doesn't seem to heinous.  It would have several of the
same touchpoints as this series, e.g. __kvm_set_memory_region() and
kvm_alloc_memslot_metadata().

The functional changes (for x86) would be a few lines in
__gfn_to_hva_memslot() and some new logic in kvm_handle_hva_range().  The
biggest concern is probably the fragility of such an implementation, as KVM
has a habit of open coding operations on memslots.

The new flags could then be paired with KVM_MEM_READONLY to yield the desired
behavior of reading out 0xff for an arbitrary range without requiring copious
memslots and/or host pages.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 852fc8274bdd..875243a0ab36 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1103,6 +1103,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
 static inline unsigned long
 __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
+       if (unlikely(slot->flags & KVM_MEM_SINGLE_HVA))
+               return slot->userspace_addr;
+
        return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
 }


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  2020-08-17 16:32       ` Sean Christopherson
@ 2020-08-21  1:46         ` Michael S. Tsirkin
  2020-08-22  3:19           ` Sean Christopherson
  0 siblings, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2020-08-21  1:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Peter Xu, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Mon, Aug 17, 2020 at 09:32:07AM -0700, Sean Christopherson wrote:
> On Fri, Aug 14, 2020 at 10:30:14AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Aug 13, 2020 at 07:31:39PM -0700, Sean Christopherson wrote:
> > > > @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > > >  	int r;
> > > >  	unsigned long addr;
> > > >  
> > > > +	if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> > > > +		memset(data, 0xff, len);
> > > > +		return 0;
> > > > +	}
> > > 
> > > This feels wrong, shouldn't we be treating PCI_HOLE as MMIO?  Given that
> > > this is performance oriented, I would think we'd want to leverage the
> > > GPA from the VMCS instead of doing a full translation.
> > > 
> > > That brings up a potential alternative to adding a memslot flag.  What if
> > > we instead add a KVM_MMIO_BUS device similar to coalesced MMIO?  I think
> > > it'd be about the same amount of KVM code, and it would provide userspace
> > > with more flexibility, e.g. I assume it would allow handling even writes
> > > wholly within the kernel for certain ranges and/or use cases, and it'd
> > > allow stuffing a value other than 0xff (though I have no idea if there is
> > > a use case for this).
> > 
> > I still think down the road the way to go is to map
> > valid RO page full of 0xff to avoid exit on read.
> > I don't think a KVM_MMIO_BUS device will allow this, will it?
> 
> No, it would not, but adding KVM_MEM_PCI_HOLE doesn't get us any closer to
> solving that problem either.

I'm not sure why. Care to elaborate?

> What if we add a flag to allow routing all GFNs in a memslot to a single
> HVA?

An issue here would be this breaks attempts to use a hugepage for this.


>  At a glance, it doesn't seem to heinous.  It would have several of the
> same touchpoints as this series, e.g. __kvm_set_memory_region() and
> kvm_alloc_memslot_metadata().
> 
> The functional changes (for x86) would be a few lines in
> __gfn_to_hva_memslot() and some new logic in kvm_handle_hva_range().  The
> biggest concern is probably the fragility of such an implementation, as KVM
> has a habit of open coding operations on memslots.
> 
> The new flags could then be paired with KVM_MEM_READONLY to yield the desired
> behavior of reading out 0xff for an arbitrary range without requiring copious
> memslots and/or host pages.
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 852fc8274bdd..875243a0ab36 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1103,6 +1103,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>  static inline unsigned long
>  __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> +       if (unlikely(slot->flags & KVM_MEM_SINGLE_HVA))
> +               return slot->userspace_addr;
> +
>         return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
>  }


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  2020-08-21  1:46         ` Michael S. Tsirkin
@ 2020-08-22  3:19           ` Sean Christopherson
  0 siblings, 0 replies; 31+ messages in thread
From: Sean Christopherson @ 2020-08-22  3:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vitaly Kuznetsov, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Peter Xu, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Thu, Aug 20, 2020 at 09:46:25PM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 17, 2020 at 09:32:07AM -0700, Sean Christopherson wrote:
> > On Fri, Aug 14, 2020 at 10:30:14AM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Aug 13, 2020 at 07:31:39PM -0700, Sean Christopherson wrote:
> > > > > @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > > > >  	int r;
> > > > >  	unsigned long addr;
> > > > >  
> > > > > +	if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> > > > > +		memset(data, 0xff, len);
> > > > > +		return 0;
> > > > > +	}
> > > > 
> > > > This feels wrong, shouldn't we be treating PCI_HOLE as MMIO?  Given that
> > > > this is performance oriented, I would think we'd want to leverage the
> > > > GPA from the VMCS instead of doing a full translation.
> > > > 
> > > > That brings up a potential alternative to adding a memslot flag.  What if
> > > > we instead add a KVM_MMIO_BUS device similar to coalesced MMIO?  I think
> > > > it'd be about the same amount of KVM code, and it would provide userspace
> > > > with more flexibility, e.g. I assume it would allow handling even writes
> > > > wholly within the kernel for certain ranges and/or use cases, and it'd
> > > > allow stuffing a value other than 0xff (though I have no idea if there is
> > > > a use case for this).
> > > 
> > > I still think down the road the way to go is to map
> > > valid RO page full of 0xff to avoid exit on read.
> > > I don't think a KVM_MMIO_BUS device will allow this, will it?
> > 
> > No, it would not, but adding KVM_MEM_PCI_HOLE doesn't get us any closer to
> > solving that problem either.
> 
> I'm not sure why. Care to elaborate?

The bulk of the code in this series would get thrown away if KVM_MEM_PCI_HOLE
were reworked to be backed by a physical page.  If we really want a physical
page, then let's use a physical page from the get-go.

I realize I suggested the specialized MMIO idea, but that's when I thought the
primary motivation was memory, not performance.

> > What if we add a flag to allow routing all GFNs in a memslot to a single
> > HVA?
> 
> An issue here would be this breaks attempts to use a hugepage for this.

What are the performance numbers of hugepage vs. aggressively prefetching
SPTEs?  Note, the unbounded prefetching from the original RFC won't fly,
but prefetching 2mb ranges might be reasonable.

Reraising an earlier unanswered question, is enlightening the guest an
option for this use case?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-08-07 14:12 [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
                   ` (2 preceding siblings ...)
  2020-08-07 14:12 ` [PATCH v2 3/3] KVM: selftests: add KVM_MEM_PCI_HOLE test Vitaly Kuznetsov
@ 2020-08-25 21:25 ` Peter Xu
  2020-09-01 14:43   ` Vitaly Kuznetsov
  2020-09-04  7:20   ` Gerd Hoffmann
  3 siblings, 2 replies; 31+ messages in thread
From: Peter Xu @ 2020-08-25 21:25 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Fri, Aug 07, 2020 at 04:12:29PM +0200, Vitaly Kuznetsov wrote:
> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
> I observed 8193 accesses to PCI hole memory. When such exit is handled
> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
> Handling the same exit in userspace is six times slower (0.000006 sec) so
> the overal; difference is 0.04 sec. This may be significant for 'microvm'
> ideas.

Sorry to comment so late, but just curious... have you looked at what's those
8000+ accesses to PCI holes and what they're used for?  What I can think of are
some port IO reads (e.g. upon vendor ID field) during BIOS to scan the devices
attached.  Though those should be far less than 8000+, and those should also be
pio rather than mmio.

If this is only an overhead for virt (since baremetal mmios should be fast),
I'm also thinking whether we can make it even better to skip those pci hole
reads.  Because we know we're virt, so it also gives us possibility that we may
provide those information in a better way than reading PCI holes in the guest?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
  2020-08-14  1:40   ` Sean Christopherson
@ 2020-09-01 14:15     ` Vitaly Kuznetsov
  2020-09-04  3:47       ` Sean Christopherson
  0 siblings, 1 reply; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-09-01 14:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Fri, Aug 07, 2020 at 04:12:30PM +0200, Vitaly Kuznetsov wrote:
>> No functional change intended. Slot flags will need to be analyzed
>> prior to try_async_pf() when KVM_MEM_PCI_HOLE is implemented.
>

(Sorry it took me so long to reply. No, I wasn't hoping for Paolo's
magical "queued, thanks", I just tried to not read my email while on
vacation).

> Why?  Wouldn't it be just as easy, and arguably more appropriate, to add
> KVM_PFN_ERR_PCI_HOLE and update handle_abornmal_pfn() accordinaly?
>

Yes, we can do that, but what I don't quite like here is that
try_async_pf() does much more than 'trying async PF'. In particular, it
extracts 'pfn' and this is far from being obvious. Maybe we can rename
try_async_pf() somewhat smartly (e.g. 'try_handle_pf()')? Your
suggestion will make perfect sense to me then.

>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/mmu/mmu.c         | 14 ++++++++------
>>  arch/x86/kvm/mmu/paging_tmpl.h |  7 +++++--
>>  2 files changed, 13 insertions(+), 8 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 862bf418214e..fef6956393f7 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -4042,11 +4042,10 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>>  				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
>>  }
>>  
>> -static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>> -			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
>> -			 bool *writable)
>> +static bool try_async_pf(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>> +			 bool prefault, gfn_t gfn, gpa_t cr2_or_gpa,
>> +			 kvm_pfn_t *pfn, bool write, bool *writable)
>>  {
>> -	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>>  	bool async;
>>  
>>  	/* Don't expose private memslots to L2. */
>> @@ -4082,7 +4081,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  	bool exec = error_code & PFERR_FETCH_MASK;
>>  	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
>>  	bool map_writable;
>> -
>> +	struct kvm_memory_slot *slot;
>>  	gfn_t gfn = gpa >> PAGE_SHIFT;
>>  	unsigned long mmu_seq;
>>  	kvm_pfn_t pfn;
>> @@ -4104,7 +4103,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>>  	smp_rmb();
>>  
>> -	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
>> +	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>> +
>> +	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
>> +			 &map_writable))
>>  		return RET_PF_RETRY;
>>  
>>  	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 0172a949f6a7..5c6a895f67c3 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -779,6 +779,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>>  	int write_fault = error_code & PFERR_WRITE_MASK;
>>  	int user_fault = error_code & PFERR_USER_MASK;
>>  	struct guest_walker walker;
>> +	struct kvm_memory_slot *slot;
>>  	int r;
>>  	kvm_pfn_t pfn;
>>  	unsigned long mmu_seq;
>> @@ -833,8 +834,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>>  	smp_rmb();
>>  
>> -	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
>> -			 &map_writable))
>> +	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
>> +
>> +	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
>> +			 write_fault, &map_writable))
>>  		return RET_PF_RETRY;
>>  
>>  	if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
>> -- 
>> 2.25.4
>> 
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  2020-08-14  2:31   ` Sean Christopherson
  2020-08-14 14:30     ` Michael S. Tsirkin
@ 2020-09-01 14:39     ` Vitaly Kuznetsov
  2020-09-03  9:35       ` Michael S. Tsirkin
  1 sibling, 1 reply; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-09-01 14:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Fri, Aug 07, 2020 at 04:12:31PM +0200, Vitaly Kuznetsov wrote:
>> PCIe config space can (depending on the configuration) be quite big but
>> usually is sparsely populated. Guest may scan it by accessing individual
>> device's page which, when device is missing, is supposed to have 'pci
>> hole' semantics: reads return '0xff' and writes get discarded. Compared
>> to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
>> real memory and stuff it with '0xff'.
>> 
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  Documentation/virt/kvm/api.rst  | 18 ++++++++++-----
>>  arch/x86/include/uapi/asm/kvm.h |  1 +
>>  arch/x86/kvm/mmu/mmu.c          |  5 ++++-
>>  arch/x86/kvm/mmu/paging_tmpl.h  |  3 +++
>>  arch/x86/kvm/x86.c              | 10 ++++++---
>>  include/linux/kvm_host.h        |  3 +++
>>  include/uapi/linux/kvm.h        |  2 ++
>>  virt/kvm/kvm_main.c             | 39 +++++++++++++++++++++++++++------
>>  8 files changed, 64 insertions(+), 17 deletions(-)
>> 
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 644e5326aa50..dc4172352635 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
>>    /* for kvm_memory_region::flags */
>>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>>    #define KVM_MEM_READONLY	(1UL << 1)
>> +  #define KVM_MEM_PCI_HOLE		(1UL << 2)
>>  
>>  This ioctl allows the user to create, modify or delete a guest physical
>>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
>> @@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>>  be identical.  This allows large pages in the guest to be backed by large
>>  pages in the host.
>>  
>> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
>> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
>> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
>> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
>> -to make a new slot read-only.  In this case, writes to this memory will be
>> -posted to userspace as KVM_EXIT_MMIO exits.
>> +The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
>> +KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
>> +- KVM_MEM_LOG_DIRTY_PAGES: log writes.  Use KVM_GET_DIRTY_LOG to retreive
>> +  the log.
>> +- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes.  Only
>> +  available when KVM_CAP_READONLY_MEM is present.
>> +- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
>> +  KVM_EXIT_MMIO on writes.  Only available when KVM_CAP_PCI_HOLE_MEM is
>> +  present.  When setting the memory region 'userspace_addr' must be NULL.
>> +  This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
>> +  KVM_MEM_READONLY.
>>  
>>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>>  the memory region are automatically reflected into the guest.  For example, an
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 17c5a038f42d..cf80a26d74f5 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -48,6 +48,7 @@
>>  #define __KVM_HAVE_XSAVE
>>  #define __KVM_HAVE_XCRS
>>  #define __KVM_HAVE_READONLY_MEM
>> +#define __KVM_HAVE_PCI_HOLE_MEM
>>  
>>  /* Architectural interrupt line count. */
>>  #define KVM_NR_INTERRUPTS 256
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index fef6956393f7..4a2a7fface1e 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
>>  		return PG_LEVEL_4K;
>>  
>>  	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
>> -	if (!slot)
>> +	if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))
>
> This is unnecessary since you're setting disallow_lpage in
> kvm_alloc_memslot_metadata().
>

Yea, redundant precaution, can be dropped.

>>  		return PG_LEVEL_4K;
>>  
>>  	max_level = min(max_level, max_huge_page_level);
>> @@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  
>>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>>  
>> +	if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))
>
> I'm confused.  Why does this short circuit reads but not writes?
>

The idea was that guests shouldn't normally write to these regions and
we may want to catch them if they do. We can short circuit writes too by
simply ignoring them.

>> +		return RET_PF_EMULATE;
>> +
>>  	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
>>  			 &map_writable))
>>  		return RET_PF_RETRY;
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 5c6a895f67c3..27abd69e69f6 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>>  
>>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
>>  
>> +	if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
>> +		return RET_PF_EMULATE;
>> +
>>  	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
>>  			 write_fault, &map_writable))
>>  		return RET_PF_RETRY;
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index dc4370394ab8..538bc58a22db 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  	case KVM_CAP_EXCEPTION_PAYLOAD:
>>  	case KVM_CAP_SET_GUEST_DEBUG:
>>  	case KVM_CAP_LAST_CPU:
>> +	case KVM_CAP_PCI_HOLE_MEM:
>>  		r = 1;
>>  		break;
>>  	case KVM_CAP_SYNC_REGS:
>> @@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
>>  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
>>  		/*
>>  		 * If the gfn and userspace address are not aligned wrt each
>> -		 * other, disable large page support for this slot.
>> +		 * other, disable large page support for this slot. Also,
>> +		 * disable large page support for KVM_MEM_PCI_HOLE slots.
>>  		 */
>> -		if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
>> +		if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
>> +				      (KVM_PAGES_PER_HPAGE(level) - 1))) {
>>  			unsigned long j;
>>  
>>  			for (j = 0; j < lpages; ++j)
>> @@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>>  	 * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
>>  	 * See comments below.
>>  	 */
>> -	if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
>> +	if ((change != KVM_MR_FLAGS_ONLY) ||
>> +	    (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
>>  		return;
>>  
>>  	/*
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 989afcbe642f..de1faa64a8ef 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>>  static inline unsigned long
>>  __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
>>  {
>> +	/* Should never be called with a KVM_MEM_PCI_HOLE slot */
>> +	BUG_ON(!slot->userspace_addr);
>
> So _technically_, userspace can hit this by allowing virtual address 0,
> which is very much non-standard, but theoretically legal.  It'd probably be
> better to use a value that can't possibly be a valid userspace_addr, e.g. a
> non-canonical value.
>

I think I had '!(slot->flags & KVM_MEM_PCI_HOLE)' check in a previous
version, we can restore it (if needed) or drop the thing completely.

>> +
>>  	return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
>>  }
>>  
>
> ...
>
>> @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>>  	int r;
>>  	unsigned long addr;
>>  
>> +	if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
>> +		memset(data, 0xff, len);
>> +		return 0;
>> +	}
>
> This feels wrong, shouldn't we be treating PCI_HOLE as MMIO?  Given that
> this is performance oriented, I would think we'd want to leverage the
> GPA from the VMCS instead of doing a full translation.
>
> That brings up a potential alternative to adding a memslot flag.  What if
> we instead add a KVM_MMIO_BUS device similar to coalesced MMIO?  I think
> it'd be about the same amount of KVM code, and it would provide userspace
> with more flexibility, e.g. I assume it would allow handling even writes
> wholly within the kernel for certain ranges and/or use cases, and it'd
> allow stuffing a value other than 0xff (though I have no idea if there is
> a use case for this).

I was thinking about making this a bit more generic, like 'immutable'
memory with a userspace-supplied values, e.g. userspace would be
providing a region (e.g. a single page which will be mapped to by all
pages of the slot) but then I failed to find a use-case for that. The
PCI hole semantics seems to be the only one we actually need in the real
life.

Overall, makeing these PCI holes 'special' memory regions (slots) and
sticking to KVM_SET_USER_MEMORY_REGION feels natural to me. I also think
it would be much easier to consume from QEMU side as we won't need to
use a 'special' API when things change (e.g. a device gets added and we
need to [un]punch a hole in the 'PCI hole' space).

>
> Speaking of which, why do writes go to userspace in this series?
>

No particular reason actually, if there is no need to catch such
(stray?) writes we can simply short-circuit them to nop.  

>> +
>>  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>>  	if (kvm_is_error_hva(addr))
>>  		return -EFAULT;
>> -- 
>> 2.25.4
>> 
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-08-25 21:25 ` [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Peter Xu
@ 2020-09-01 14:43   ` Vitaly Kuznetsov
  2020-09-01 20:00     ` Peter Xu
  2020-09-04  7:20   ` Gerd Hoffmann
  1 sibling, 1 reply; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-09-01 14:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

Peter Xu <peterx@redhat.com> writes:

> On Fri, Aug 07, 2020 at 04:12:29PM +0200, Vitaly Kuznetsov wrote:
>> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
>> I observed 8193 accesses to PCI hole memory. When such exit is handled
>> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
>> Handling the same exit in userspace is six times slower (0.000006 sec) so
>> the overal; difference is 0.04 sec. This may be significant for 'microvm'
>> ideas.
>
> Sorry to comment so late, but just curious... have you looked at what's those
> 8000+ accesses to PCI holes and what they're used for?  What I can think of are
> some port IO reads (e.g. upon vendor ID field) during BIOS to scan the devices
> attached.  Though those should be far less than 8000+, and those should also be
> pio rather than mmio.

And sorry for replying late)

We explicitly want MMIO instead of PIO to speed things up, afaiu PIO
requires two exits per device (and we exit all the way to
QEMU). Julia/Michael know better about the size of the space.

>
> If this is only an overhead for virt (since baremetal mmios should be fast),
> I'm also thinking whether we can make it even better to skip those pci hole
> reads.  Because we know we're virt, so it also gives us possibility that we may
> provide those information in a better way than reading PCI holes in the guest?

This means let's invent a PV interface and if we decide to go down this
road, I'd even argue for abandoning PCI completely. E.g. we can do
something similar to Hyper-V's Vmbus.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-01 14:43   ` Vitaly Kuznetsov
@ 2020-09-01 20:00     ` Peter Xu
  2020-09-02  8:59       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-09-01 20:00 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Tue, Sep 01, 2020 at 04:43:25PM +0200, Vitaly Kuznetsov wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Aug 07, 2020 at 04:12:29PM +0200, Vitaly Kuznetsov wrote:
> >> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
> >> I observed 8193 accesses to PCI hole memory. When such exit is handled
> >> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
> >> Handling the same exit in userspace is six times slower (0.000006 sec) so
> >> the overal; difference is 0.04 sec. This may be significant for 'microvm'
> >> ideas.
> >
> > Sorry to comment so late, but just curious... have you looked at what's those
> > 8000+ accesses to PCI holes and what they're used for?  What I can think of are
> > some port IO reads (e.g. upon vendor ID field) during BIOS to scan the devices
> > attached.  Though those should be far less than 8000+, and those should also be
> > pio rather than mmio.
> 
> And sorry for replying late)
> 
> We explicitly want MMIO instead of PIO to speed things up, afaiu PIO
> requires two exits per device (and we exit all the way to
> QEMU). Julia/Michael know better about the size of the space.
> 
> >
> > If this is only an overhead for virt (since baremetal mmios should be fast),
> > I'm also thinking whether we can make it even better to skip those pci hole
> > reads.  Because we know we're virt, so it also gives us possibility that we may
> > provide those information in a better way than reading PCI holes in the guest?
> 
> This means let's invent a PV interface and if we decide to go down this
> road, I'd even argue for abandoning PCI completely. E.g. we can do
> something similar to Hyper-V's Vmbus.

My whole point was more about trying to understand the problem behind.
Providing a fast path for reading pci holes seems to be reasonable as is,
however it's just that I'm confused on why there're so many reads on the pci
holes after all.  Another important question is I'm wondering how this series
will finally help the use case of microvm.  I'm not sure I get the whole point
of it, but... if microvm is the major use case of this, it would be good to
provide some quick numbers on those if possible.

For example, IIUC microvm uses qboot (as a better alternative than seabios) for
fast boot, and qboot has:

https://github.com/bonzini/qboot/blob/master/pci.c#L20

I'm kind of curious whether qboot will still be used when this series is used
with microvm VMs?  Since those are still at least PIO based.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-01 20:00     ` Peter Xu
@ 2020-09-02  8:59       ` Vitaly Kuznetsov
  2020-09-04  6:12         ` Sean Christopherson
  0 siblings, 1 reply; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-09-02  8:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

Peter Xu <peterx@redhat.com> writes:

> On Tue, Sep 01, 2020 at 04:43:25PM +0200, Vitaly Kuznetsov wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Fri, Aug 07, 2020 at 04:12:29PM +0200, Vitaly Kuznetsov wrote:
>> >> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
>> >> I observed 8193 accesses to PCI hole memory. When such exit is handled
>> >> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
>> >> Handling the same exit in userspace is six times slower (0.000006 sec) so
>> >> the overal; difference is 0.04 sec. This may be significant for 'microvm'
>> >> ideas.
>> >
>> > Sorry to comment so late, but just curious... have you looked at what's those
>> > 8000+ accesses to PCI holes and what they're used for?  What I can think of are
>> > some port IO reads (e.g. upon vendor ID field) during BIOS to scan the devices
>> > attached.  Though those should be far less than 8000+, and those should also be
>> > pio rather than mmio.
>> 
>> And sorry for replying late)
>> 
>> We explicitly want MMIO instead of PIO to speed things up, afaiu PIO
>> requires two exits per device (and we exit all the way to
>> QEMU). Julia/Michael know better about the size of the space.
>> 
>> >
>> > If this is only an overhead for virt (since baremetal mmios should be fast),
>> > I'm also thinking whether we can make it even better to skip those pci hole
>> > reads.  Because we know we're virt, so it also gives us possibility that we may
>> > provide those information in a better way than reading PCI holes in the guest?
>> 
>> This means let's invent a PV interface and if we decide to go down this
>> road, I'd even argue for abandoning PCI completely. E.g. we can do
>> something similar to Hyper-V's Vmbus.
>
> My whole point was more about trying to understand the problem behind.
> Providing a fast path for reading pci holes seems to be reasonable as is,
> however it's just that I'm confused on why there're so many reads on the pci
> holes after all.  Another important question is I'm wondering how this series
> will finally help the use case of microvm.  I'm not sure I get the whole point
> of it, but... if microvm is the major use case of this, it would be good to
> provide some quick numbers on those if possible.
>
> For example, IIUC microvm uses qboot (as a better alternative than seabios) for
> fast boot, and qboot has:
>
> https://github.com/bonzini/qboot/blob/master/pci.c#L20
>
> I'm kind of curious whether qboot will still be used when this series is used
> with microvm VMs?  Since those are still at least PIO based.

I'm afraid there is no 'grand plan' for everything at this moment :-(
For traditional VMs 0.04 sec per boot is negligible and definitely not
worth adding a feature, memory requirements are also very
different. When it comes to microvm-style usage things change.

'8193' PCI hole accesses I mention in the PATCH0 blurb are just from
Linux as I was doing direct kernel boot, we can't get better than that
(if PCI is in the game of course). Firmware (qboot, seabios,...) can
only add more. I *think* the plan is to eventually switch them all to
MMCFG, at least for KVM guests, by default but we need something to put
to the advertisement. 

We can, in theory, short circuit PIO in KVM instead but:
- We will need a complete different API
- We will never be able to reach the speed of the exit-less 'single 0xff
page' solution (see my RFC).

-- 
Vitaly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory
  2020-09-01 14:39     ` Vitaly Kuznetsov
@ 2020-09-03  9:35       ` Michael S. Tsirkin
  0 siblings, 0 replies; 31+ messages in thread
From: Michael S. Tsirkin @ 2020-09-03  9:35 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Peter Xu, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Tue, Sep 01, 2020 at 04:39:22PM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > On Fri, Aug 07, 2020 at 04:12:31PM +0200, Vitaly Kuznetsov wrote:
> >> PCIe config space can (depending on the configuration) be quite big but
> >> usually is sparsely populated. Guest may scan it by accessing individual
> >> device's page which, when device is missing, is supposed to have 'pci
> >> hole' semantics: reads return '0xff' and writes get discarded. Compared
> >> to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
> >> real memory and stuff it with '0xff'.
> >> 
> >> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> >> ---
> >>  Documentation/virt/kvm/api.rst  | 18 ++++++++++-----
> >>  arch/x86/include/uapi/asm/kvm.h |  1 +
> >>  arch/x86/kvm/mmu/mmu.c          |  5 ++++-
> >>  arch/x86/kvm/mmu/paging_tmpl.h  |  3 +++
> >>  arch/x86/kvm/x86.c              | 10 ++++++---
> >>  include/linux/kvm_host.h        |  3 +++
> >>  include/uapi/linux/kvm.h        |  2 ++
> >>  virt/kvm/kvm_main.c             | 39 +++++++++++++++++++++++++++------
> >>  8 files changed, 64 insertions(+), 17 deletions(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index 644e5326aa50..dc4172352635 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
> >>    /* for kvm_memory_region::flags */
> >>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >>    #define KVM_MEM_READONLY	(1UL << 1)
> >> +  #define KVM_MEM_PCI_HOLE		(1UL << 2)
> >>  
> >>  This ioctl allows the user to create, modify or delete a guest physical
> >>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> >> @@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> >>  be identical.  This allows large pages in the guest to be backed by large
> >>  pages in the host.
> >>  
> >> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> >> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> >> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> >> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> >> -to make a new slot read-only.  In this case, writes to this memory will be
> >> -posted to userspace as KVM_EXIT_MMIO exits.
> >> +The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
> >> +KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
> >> +- KVM_MEM_LOG_DIRTY_PAGES: log writes.  Use KVM_GET_DIRTY_LOG to retreive
> >> +  the log.
> >> +- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes.  Only
> >> +  available when KVM_CAP_READONLY_MEM is present.
> >> +- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
> >> +  KVM_EXIT_MMIO on writes.  Only available when KVM_CAP_PCI_HOLE_MEM is
> >> +  present.  When setting the memory region 'userspace_addr' must be NULL.
> >> +  This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
> >> +  KVM_MEM_READONLY.
> >>  
> >>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> >>  the memory region are automatically reflected into the guest.  For example, an
> >> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> >> index 17c5a038f42d..cf80a26d74f5 100644
> >> --- a/arch/x86/include/uapi/asm/kvm.h
> >> +++ b/arch/x86/include/uapi/asm/kvm.h
> >> @@ -48,6 +48,7 @@
> >>  #define __KVM_HAVE_XSAVE
> >>  #define __KVM_HAVE_XCRS
> >>  #define __KVM_HAVE_READONLY_MEM
> >> +#define __KVM_HAVE_PCI_HOLE_MEM
> >>  
> >>  /* Architectural interrupt line count. */
> >>  #define KVM_NR_INTERRUPTS 256
> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> index fef6956393f7..4a2a7fface1e 100644
> >> --- a/arch/x86/kvm/mmu/mmu.c
> >> +++ b/arch/x86/kvm/mmu/mmu.c
> >> @@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
> >>  		return PG_LEVEL_4K;
> >>  
> >>  	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
> >> -	if (!slot)
> >> +	if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))
> >
> > This is unnecessary since you're setting disallow_lpage in
> > kvm_alloc_memslot_metadata().
> >
> 
> Yea, redundant precaution, can be dropped.
> 
> >>  		return PG_LEVEL_4K;
> >>  
> >>  	max_level = min(max_level, max_huge_page_level);
> >> @@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >>  
> >>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >>  
> >> +	if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> >
> > I'm confused.  Why does this short circuit reads but not writes?
> >
> 
> The idea was that guests shouldn't normally write to these regions and
> we may want to catch them if they do. We can short circuit writes too by
> simply ignoring them.

Another point is that write by guests might need to set bits
in the root port error reporting cap.

> >> +		return RET_PF_EMULATE;
> >> +
> >>  	if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
> >>  			 &map_writable))
> >>  		return RET_PF_RETRY;
> >> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> >> index 5c6a895f67c3..27abd69e69f6 100644
> >> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> >> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> >> @@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
> >>  
> >>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
> >>  
> >> +	if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> >> +		return RET_PF_EMULATE;
> >> +
> >>  	if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
> >>  			 write_fault, &map_writable))
> >>  		return RET_PF_RETRY;
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index dc4370394ab8..538bc58a22db 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  	case KVM_CAP_EXCEPTION_PAYLOAD:
> >>  	case KVM_CAP_SET_GUEST_DEBUG:
> >>  	case KVM_CAP_LAST_CPU:
> >> +	case KVM_CAP_PCI_HOLE_MEM:
> >>  		r = 1;
> >>  		break;
> >>  	case KVM_CAP_SYNC_REGS:
> >> @@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
> >>  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> >>  		/*
> >>  		 * If the gfn and userspace address are not aligned wrt each
> >> -		 * other, disable large page support for this slot.
> >> +		 * other, disable large page support for this slot. Also,
> >> +		 * disable large page support for KVM_MEM_PCI_HOLE slots.
> >>  		 */
> >> -		if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
> >> +		if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
> >> +				      (KVM_PAGES_PER_HPAGE(level) - 1))) {
> >>  			unsigned long j;
> >>  
> >>  			for (j = 0; j < lpages; ++j)
> >> @@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >>  	 * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
> >>  	 * See comments below.
> >>  	 */
> >> -	if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
> >> +	if ((change != KVM_MR_FLAGS_ONLY) ||
> >> +	    (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
> >>  		return;
> >>  
> >>  	/*
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index 989afcbe642f..de1faa64a8ef 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> >>  static inline unsigned long
> >>  __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> >>  {
> >> +	/* Should never be called with a KVM_MEM_PCI_HOLE slot */
> >> +	BUG_ON(!slot->userspace_addr);
> >
> > So _technically_, userspace can hit this by allowing virtual address 0,
> > which is very much non-standard, but theoretically legal.  It'd probably be
> > better to use a value that can't possibly be a valid userspace_addr, e.g. a
> > non-canonical value.
> >
> 
> I think I had '!(slot->flags & KVM_MEM_PCI_HOLE)' check in a previous
> version, we can restore it (if needed) or drop the thing completely.
> 
> >> +
> >>  	return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
> >>  }
> >>  
> >
> > ...
> >
> >> @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> >>  	int r;
> >>  	unsigned long addr;
> >>  
> >> +	if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> >> +		memset(data, 0xff, len);
> >> +		return 0;
> >> +	}
> >
> > This feels wrong, shouldn't we be treating PCI_HOLE as MMIO?  Given that
> > this is performance oriented, I would think we'd want to leverage the
> > GPA from the VMCS instead of doing a full translation.
> >
> > That brings up a potential alternative to adding a memslot flag.  What if
> > we instead add a KVM_MMIO_BUS device similar to coalesced MMIO?  I think
> > it'd be about the same amount of KVM code, and it would provide userspace
> > with more flexibility, e.g. I assume it would allow handling even writes
> > wholly within the kernel for certain ranges and/or use cases, and it'd
> > allow stuffing a value other than 0xff (though I have no idea if there is
> > a use case for this).
> 
> I was thinking about making this a bit more generic, like 'immutable'
> memory with a userspace-supplied values, e.g. userspace would be
> providing a region (e.g. a single page which will be mapped to by all
> pages of the slot) but then I failed to find a use-case for that. The
> PCI hole semantics seems to be the only one we actually need in the real
> life.
> 
> Overall, makeing these PCI holes 'special' memory regions (slots) and
> sticking to KVM_SET_USER_MEMORY_REGION feels natural to me. I also think
> it would be much easier to consume from QEMU side as we won't need to
> use a 'special' API when things change (e.g. a device gets added and we
> need to [un]punch a hole in the 'PCI hole' space).
> 
> >
> > Speaking of which, why do writes go to userspace in this series?
> >
> 
> No particular reason actually, if there is no need to catch such
> (stray?) writes we can simply short-circuit them to nop.  

With AER we might want to handle them I think.

> >> +
> >>  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
> >>  	if (kvm_is_error_hva(addr))
> >>  		return -EFAULT;
> >> -- 
> >> 2.25.4
> >> 
> >
> 
> -- 
> Vitaly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
  2020-09-01 14:15     ` Vitaly Kuznetsov
@ 2020-09-04  3:47       ` Sean Christopherson
  0 siblings, 0 replies; 31+ messages in thread
From: Sean Christopherson @ 2020-09-04  3:47 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson, Peter Xu,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Tue, Sep 01, 2020 at 04:15:07PM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > On Fri, Aug 07, 2020 at 04:12:30PM +0200, Vitaly Kuznetsov wrote:
> >> No functional change intended. Slot flags will need to be analyzed
> >> prior to try_async_pf() when KVM_MEM_PCI_HOLE is implemented.
> >
> 
> (Sorry it took me so long to reply. No, I wasn't hoping for Paolo's
> magical "queued, thanks", I just tried to not read my email while on
> vacation).
> 
> > Why?  Wouldn't it be just as easy, and arguably more appropriate, to add
> > KVM_PFN_ERR_PCI_HOLE and update handle_abornmal_pfn() accordinaly?
> >
> 
> Yes, we can do that, but what I don't quite like here is that
> try_async_pf() does much more than 'trying async PF'. In particular, it
> extracts 'pfn' and this is far from being obvious. Maybe we can rename
> try_async_pf() somewhat smartly (e.g. 'try_handle_pf()')? Your
> suggestion will make perfect sense to me then.

Ya, try_async_pf() is a horrible name.  try_handle_pf() isn't bad, but it's
not technically handling the fault.  Maybe try_get_pfn() with an inverted
return?

	if (!try_get_pfn(...))
		return RET_PF_RETRY;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-02  8:59       ` Vitaly Kuznetsov
@ 2020-09-04  6:12         ` Sean Christopherson
  2020-09-04  7:29           ` Gerd Hoffmann
                             ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Sean Christopherson @ 2020-09-04  6:12 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Peter Xu, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Wed, Sep 02, 2020 at 10:59:20AM +0200, Vitaly Kuznetsov wrote:
> Peter Xu <peterx@redhat.com> writes:
> > My whole point was more about trying to understand the problem behind.
> > Providing a fast path for reading pci holes seems to be reasonable as is,
> > however it's just that I'm confused on why there're so many reads on the pci
> > holes after all.  Another important question is I'm wondering how this series
> > will finally help the use case of microvm.  I'm not sure I get the whole point
> > of it, but... if microvm is the major use case of this, it would be good to
> > provide some quick numbers on those if possible.
> >
> > For example, IIUC microvm uses qboot (as a better alternative than seabios) for
> > fast boot, and qboot has:
> >
> > https://github.com/bonzini/qboot/blob/master/pci.c#L20
> >
> > I'm kind of curious whether qboot will still be used when this series is used
> > with microvm VMs?  Since those are still at least PIO based.
> 
> I'm afraid there is no 'grand plan' for everything at this moment :-(
> For traditional VMs 0.04 sec per boot is negligible and definitely not
> worth adding a feature, memory requirements are also very
> different. When it comes to microvm-style usage things change.
> 
> '8193' PCI hole accesses I mention in the PATCH0 blurb are just from
> Linux as I was doing direct kernel boot, we can't get better than that
> (if PCI is in the game of course). Firmware (qboot, seabios,...) can
> only add more. I *think* the plan is to eventually switch them all to
> MMCFG, at least for KVM guests, by default but we need something to put
> to the advertisement. 

I see a similar ~8k PCI hole reads with a -kernel boot w/ OVMF.  All but 60
of those are from pcibios_fixup_peer_bridges(), and all are from the kernel.
My understanding is that pcibios_fixup_peer_bridges() is useful if and only
if there multiple root buses.  And AFAICT, when running under QEMU, the only
way for there to be multiple buses in is if there is an explicit bridge
created ("pxb" or "pxb-pcie").  Based on the cover letter from those[*], the
main reason for creating a bridge is to handle pinned CPUs on a NUMA system
with pass-through devices.  That use case seems highly unlikely to cross
paths with micro VMs, i.e. micro VMs will only ever have a single bus.
Unless I'm mistaken, microvm doesn't even support PCI, does it?

If all of the above is true, this can be handled by adding "pci=lastbus=0"
as a guest kernel param to override its scanning of buses.  And couldn't
that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
to the end user?

[*] https://www.redhat.com/archives/libvir-list/2016-March/msg01213.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-08-25 21:25 ` [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Peter Xu
  2020-09-01 14:43   ` Vitaly Kuznetsov
@ 2020-09-04  7:20   ` Gerd Hoffmann
  1 sibling, 0 replies; 31+ messages in thread
From: Gerd Hoffmann @ 2020-09-04  7:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Vitaly Kuznetsov, kvm, Paolo Bonzini, Sean Christopherson,
	Wanpeng Li, Jim Mattson, Michael Tsirkin, Julia Suvorova,
	Andy Lutomirski, Andrew Jones, linux-kernel

  Hi,

> attached.  Though those should be far less than 8000+, and those should also be
> pio rather than mmio.

Well, seabios 1.14 (qemu 5.1) added mmio support (to speed up boot a
little bit, mmio is one and pio is two vmexits).

Depends on q35 obviously, and a few pio accesses remain b/c seabios has
to first setup mmconfig.

take care,
  Gerd


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-04  6:12         ` Sean Christopherson
@ 2020-09-04  7:29           ` Gerd Hoffmann
  2020-09-04 16:00             ` Sean Christopherson
  2020-09-07 10:52             ` Michael S. Tsirkin
  2020-09-07 10:54           ` Michael S. Tsirkin
  2020-09-18  8:49           ` Gerd Hoffmann
  2 siblings, 2 replies; 31+ messages in thread
From: Gerd Hoffmann @ 2020-09-04  7:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Peter Xu, kvm, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, Michael Tsirkin, Julia Suvorova, Andy Lutomirski,
	Andrew Jones, linux-kernel

  Hi,

> Unless I'm mistaken, microvm doesn't even support PCI, does it?

Correct, no pci support right now.

We could probably wire up ecam (arm/virt style) for pcie support, once
the acpi support for mictovm finally landed (we need acpi for that
because otherwise the kernel wouldn't find the pcie bus).

Question is whenever there is a good reason to do so.  Why would someone
prefer microvm with pcie support over q35?

> If all of the above is true, this can be handled by adding "pci=lastbus=0"
> as a guest kernel param to override its scanning of buses.  And couldn't
> that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> to the end user?

microvm_fix_kernel_cmdline() is a hack, not a solution.

Beside that I doubt this has much of an effect on microvm because
it doesn't support pcie in the first place.

take care,
  Gerd


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-04  7:29           ` Gerd Hoffmann
@ 2020-09-04 16:00             ` Sean Christopherson
  2020-09-07  8:37               ` Vitaly Kuznetsov
  2020-09-07 10:52             ` Michael S. Tsirkin
  1 sibling, 1 reply; 31+ messages in thread
From: Sean Christopherson @ 2020-09-04 16:00 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Vitaly Kuznetsov, Peter Xu, kvm, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, Michael Tsirkin, Julia Suvorova, Andy Lutomirski,
	Andrew Jones, linux-kernel

On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
> 
> Correct, no pci support right now.
> 
> We could probably wire up ecam (arm/virt style) for pcie support, once
> the acpi support for mictovm finally landed (we need acpi for that
> because otherwise the kernel wouldn't find the pcie bus).
> 
> Question is whenever there is a good reason to do so.  Why would someone
> prefer microvm with pcie support over q35?
> 
> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> > as a guest kernel param to override its scanning of buses.  And couldn't
> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> > to the end user?
> 
> microvm_fix_kernel_cmdline() is a hack, not a solution.
> 
> Beside that I doubt this has much of an effect on microvm because
> it doesn't support pcie in the first place.

I am so confused.  Vitaly, can you clarify exactly what QEMU VM type this
series is intended to help?  If this is for microvm, then why is the guest
doing PCI scanning in the first place?  If it's for q35, why is the
justification for microvm-like workloads?

Either way, I think it makes sense explore other options before throwing
something into KVM, e.g. modifying guest command line, adding a KVM hint,
"fixing" QEMU, etc... 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-04 16:00             ` Sean Christopherson
@ 2020-09-07  8:37               ` Vitaly Kuznetsov
  2020-09-07 11:32                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 31+ messages in thread
From: Vitaly Kuznetsov @ 2020-09-07  8:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Michael Tsirkin, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel, Gerd Hoffmann

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
>>   Hi,
>> 
>> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
>> 
>> Correct, no pci support right now.
>> 
>> We could probably wire up ecam (arm/virt style) for pcie support, once
>> the acpi support for mictovm finally landed (we need acpi for that
>> because otherwise the kernel wouldn't find the pcie bus).
>> 
>> Question is whenever there is a good reason to do so.  Why would someone
>> prefer microvm with pcie support over q35?
>> 
>> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
>> > as a guest kernel param to override its scanning of buses.  And couldn't
>> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
>> > to the end user?
>> 
>> microvm_fix_kernel_cmdline() is a hack, not a solution.
>> 
>> Beside that I doubt this has much of an effect on microvm because
>> it doesn't support pcie in the first place.
>
> I am so confused.  Vitaly, can you clarify exactly what QEMU VM type this
> series is intended to help?  If this is for microvm, then why is the guest
> doing PCI scanning in the first place?  If it's for q35, why is the
> justification for microvm-like workloads?

I'm not exactly sure about the plans for particular machine types, the
intention was to use this for pcie in QEMU in general so whatever
machine type uses pcie will benefit. 

Now, it seems that we have a more sophisticated landscape. The
optimization will only make sense to speed up boot so all 'traditional'
VM types with 'traditional' firmware are out of
question. 'Container-like' VMs seem to avoid PCI for now, I'm not sure
if it's because they're in early stages of their development, because
they can get away without PCI or, actually, because of slowness at boot
(which we're trying to tackle with this feature). I'd definitely like to
hear more what people think about this.

>
> Either way, I think it makes sense explore other options before throwing
> something into KVM, e.g. modifying guest command line, adding a KVM hint,
> "fixing" QEMU, etc... 
>

Initially, this feature looked like a small and straitforward
(micro-)optimization to me: memory regions with 'PCI hole' semantics do
exist and we can speed up access to them. Ideally, I'd like to find
other 'constant memory' regions requiring fast access and come up with
an interface to create them in KVM but so far nothing interesting came
up...

-- 
Vitaly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-04  7:29           ` Gerd Hoffmann
  2020-09-04 16:00             ` Sean Christopherson
@ 2020-09-07 10:52             ` Michael S. Tsirkin
  2020-09-18  9:33               ` Gerd Hoffmann
  1 sibling, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2020-09-07 10:52 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Sean Christopherson, Vitaly Kuznetsov, Peter Xu, kvm,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, Julia Suvorova,
	Andy Lutomirski, Andrew Jones, linux-kernel

On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
> 
> Correct, no pci support right now.
> 
> We could probably wire up ecam (arm/virt style) for pcie support, once
> the acpi support for mictovm finally landed (we need acpi for that
> because otherwise the kernel wouldn't find the pcie bus).
> 
> Question is whenever there is a good reason to do so.  Why would someone
> prefer microvm with pcie support over q35?

The usual reasons to use pcie apply to microvm just the same.
E.g.: pass through of pcie devices?

> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> > as a guest kernel param to override its scanning of buses.  And couldn't
> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> > to the end user?
> 
> microvm_fix_kernel_cmdline() is a hack, not a solution.
> 
> Beside that I doubt this has much of an effect on microvm because
> it doesn't support pcie in the first place.
> 
> take care,
>   Gerd


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-04  6:12         ` Sean Christopherson
  2020-09-04  7:29           ` Gerd Hoffmann
@ 2020-09-07 10:54           ` Michael S. Tsirkin
  2020-09-18  8:49           ` Gerd Hoffmann
  2 siblings, 0 replies; 31+ messages in thread
From: Michael S. Tsirkin @ 2020-09-07 10:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Peter Xu, kvm, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel

On Thu, Sep 03, 2020 at 11:12:12PM -0700, Sean Christopherson wrote:
> On Wed, Sep 02, 2020 at 10:59:20AM +0200, Vitaly Kuznetsov wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > > My whole point was more about trying to understand the problem behind.
> > > Providing a fast path for reading pci holes seems to be reasonable as is,
> > > however it's just that I'm confused on why there're so many reads on the pci
> > > holes after all.  Another important question is I'm wondering how this series
> > > will finally help the use case of microvm.  I'm not sure I get the whole point
> > > of it, but... if microvm is the major use case of this, it would be good to
> > > provide some quick numbers on those if possible.
> > >
> > > For example, IIUC microvm uses qboot (as a better alternative than seabios) for
> > > fast boot, and qboot has:
> > >
> > > https://github.com/bonzini/qboot/blob/master/pci.c#L20
> > >
> > > I'm kind of curious whether qboot will still be used when this series is used
> > > with microvm VMs?  Since those are still at least PIO based.
> > 
> > I'm afraid there is no 'grand plan' for everything at this moment :-(
> > For traditional VMs 0.04 sec per boot is negligible and definitely not
> > worth adding a feature, memory requirements are also very
> > different. When it comes to microvm-style usage things change.
> > 
> > '8193' PCI hole accesses I mention in the PATCH0 blurb are just from
> > Linux as I was doing direct kernel boot, we can't get better than that
> > (if PCI is in the game of course). Firmware (qboot, seabios,...) can
> > only add more. I *think* the plan is to eventually switch them all to
> > MMCFG, at least for KVM guests, by default but we need something to put
> > to the advertisement. 
> 
> I see a similar ~8k PCI hole reads with a -kernel boot w/ OVMF.  All but 60
> of those are from pcibios_fixup_peer_bridges(), and all are from the kernel.
> My understanding is that pcibios_fixup_peer_bridges() is useful if and only
> if there multiple root buses.  And AFAICT, when running under QEMU, the only
> way for there to be multiple buses in is if there is an explicit bridge
> created ("pxb" or "pxb-pcie").  Based on the cover letter from those[*], the
> main reason for creating a bridge is to handle pinned CPUs on a NUMA system
> with pass-through devices.  That use case seems highly unlikely to cross
> paths with micro VMs, i.e. micro VMs will only ever have a single bus.

My position is it's not all black and white, workloads do not
cleanly partition to these that care about boot speed and those
that don't. So IMHO we care about boot speed with pcie even if
microvm does not use it at the moment.


> Unless I'm mistaken, microvm doesn't even support PCI, does it?
> 
> If all of the above is true, this can be handled by adding "pci=lastbus=0"
> as a guest kernel param to override its scanning of buses.  And couldn't
> that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> to the end user?
> 
> [*] https://www.redhat.com/archives/libvir-list/2016-March/msg01213.html


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-07  8:37               ` Vitaly Kuznetsov
@ 2020-09-07 11:32                 ` Michael S. Tsirkin
  2020-09-11 17:00                   ` Sean Christopherson
  0 siblings, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2020-09-07 11:32 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, Peter Xu, kvm, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel, Gerd Hoffmann

On Mon, Sep 07, 2020 at 10:37:39AM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
> >>   Hi,
> >> 
> >> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
> >> 
> >> Correct, no pci support right now.
> >> 
> >> We could probably wire up ecam (arm/virt style) for pcie support, once
> >> the acpi support for mictovm finally landed (we need acpi for that
> >> because otherwise the kernel wouldn't find the pcie bus).
> >> 
> >> Question is whenever there is a good reason to do so.  Why would someone
> >> prefer microvm with pcie support over q35?
> >> 
> >> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> >> > as a guest kernel param to override its scanning of buses.  And couldn't
> >> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> >> > to the end user?
> >> 
> >> microvm_fix_kernel_cmdline() is a hack, not a solution.
> >> 
> >> Beside that I doubt this has much of an effect on microvm because
> >> it doesn't support pcie in the first place.
> >
> > I am so confused.  Vitaly, can you clarify exactly what QEMU VM type this
> > series is intended to help?  If this is for microvm, then why is the guest
> > doing PCI scanning in the first place?  If it's for q35, why is the
> > justification for microvm-like workloads?
> 
> I'm not exactly sure about the plans for particular machine types, the
> intention was to use this for pcie in QEMU in general so whatever
> machine type uses pcie will benefit. 
> 
> Now, it seems that we have a more sophisticated landscape. The
> optimization will only make sense to speed up boot so all 'traditional'
> VM types with 'traditional' firmware are out of
> question. 'Container-like' VMs seem to avoid PCI for now, I'm not sure
> if it's because they're in early stages of their development, because
> they can get away without PCI or, actually, because of slowness at boot
> (which we're trying to tackle with this feature). I'd definitely like to
> hear more what people think about this.

I suspect microvms will need pci eventually. I would much rather KVM
had an exit-less discovery mechanism in place by then because
learning from history if it doesn't they will do some kind of
hack on the kernel command line, and everyone will be stuck
supporting that for years ...

> >
> > Either way, I think it makes sense explore other options before throwing
> > something into KVM, e.g. modifying guest command line, adding a KVM hint,
> > "fixing" QEMU, etc... 
> >
> 
> Initially, this feature looked like a small and straitforward
> (micro-)optimization to me: memory regions with 'PCI hole' semantics do
> exist and we can speed up access to them. Ideally, I'd like to find
> other 'constant memory' regions requiring fast access and come up with
> an interface to create them in KVM but so far nothing interesting came
> up...

True, me neither.

> -- 
> Vitaly


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-07 11:32                 ` Michael S. Tsirkin
@ 2020-09-11 17:00                   ` Sean Christopherson
  2020-09-18 12:34                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Christopherson @ 2020-09-11 17:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vitaly Kuznetsov, Peter Xu, kvm, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel, Gerd Hoffmann

On Mon, Sep 07, 2020 at 07:32:23AM -0400, Michael S. Tsirkin wrote:
> On Mon, Sep 07, 2020 at 10:37:39AM +0200, Vitaly Kuznetsov wrote:
> > Sean Christopherson <sean.j.christopherson@intel.com> writes:
> > 
> > > On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
> > >>   Hi,
> > >> 
> > >> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
> > >> 
> > >> Correct, no pci support right now.
> > >> 
> > >> We could probably wire up ecam (arm/virt style) for pcie support, once
> > >> the acpi support for mictovm finally landed (we need acpi for that
> > >> because otherwise the kernel wouldn't find the pcie bus).
> > >> 
> > >> Question is whenever there is a good reason to do so.  Why would someone
> > >> prefer microvm with pcie support over q35?
> > >> 
> > >> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> > >> > as a guest kernel param to override its scanning of buses.  And couldn't
> > >> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> > >> > to the end user?
> > >> 
> > >> microvm_fix_kernel_cmdline() is a hack, not a solution.
> > >> 
> > >> Beside that I doubt this has much of an effect on microvm because
> > >> it doesn't support pcie in the first place.
> > >
> > > I am so confused.  Vitaly, can you clarify exactly what QEMU VM type this
> > > series is intended to help?  If this is for microvm, then why is the guest
> > > doing PCI scanning in the first place?  If it's for q35, why is the
> > > justification for microvm-like workloads?
> > 
> > I'm not exactly sure about the plans for particular machine types, the
> > intention was to use this for pcie in QEMU in general so whatever
> > machine type uses pcie will benefit. 
> > 
> > Now, it seems that we have a more sophisticated landscape. The
> > optimization will only make sense to speed up boot so all 'traditional'
> > VM types with 'traditional' firmware are out of
> > question. 'Container-like' VMs seem to avoid PCI for now, I'm not sure
> > if it's because they're in early stages of their development, because
> > they can get away without PCI or, actually, because of slowness at boot
> > (which we're trying to tackle with this feature). I'd definitely like to
> > hear more what people think about this.
> 
> I suspect microvms will need pci eventually. I would much rather KVM
> had an exit-less discovery mechanism in place by then because
> learning from history if it doesn't they will do some kind of
> hack on the kernel command line, and everyone will be stuck
> supporting that for years ...

Is it not an option for the VMM to "accurately" enumerate the number of buses?
E.g. if the VMM has devices on only bus 0, then enumerate that there is one
bus so that the guest doesn't try and probe devices that can't possibly exist.
Or is that completely non-sensical and/or violate PCIe spec?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-04  6:12         ` Sean Christopherson
  2020-09-04  7:29           ` Gerd Hoffmann
  2020-09-07 10:54           ` Michael S. Tsirkin
@ 2020-09-18  8:49           ` Gerd Hoffmann
  2 siblings, 0 replies; 31+ messages in thread
From: Gerd Hoffmann @ 2020-09-18  8:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Peter Xu, kvm, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, Michael Tsirkin, Julia Suvorova, Andy Lutomirski,
	Andrew Jones, linux-kernel

  Hi,

> I see a similar ~8k PCI hole reads with a -kernel boot w/ OVMF.  All but 60
> of those are from pcibios_fixup_peer_bridges(), and all are from the kernel.

pcibios_fixup_peer_bridges() looks at pcibios_last_bus, and that in turn
seems to be set according to the mmconfig size (in
arch/x86/pci/mmconfig-shared.c).

So, maybe we just need to declare a smaller mmconfig window in the acpi
tables, depending on the number of pci busses actually used ...

> If all of the above is true, this can be handled by adding "pci=lastbus=0"

... so we don't need manual quirks like this?

take care,
  Gerd


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-07 10:52             ` Michael S. Tsirkin
@ 2020-09-18  9:33               ` Gerd Hoffmann
  0 siblings, 0 replies; 31+ messages in thread
From: Gerd Hoffmann @ 2020-09-18  9:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sean Christopherson, Vitaly Kuznetsov, Peter Xu, kvm,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, Julia Suvorova,
	Andy Lutomirski, Andrew Jones, linux-kernel

  Hi,

> > We could probably wire up ecam (arm/virt style) for pcie support, once
> > the acpi support for mictovm finally landed (we need acpi for that
> > because otherwise the kernel wouldn't find the pcie bus).
> > 
> > Question is whenever there is a good reason to do so.  Why would someone
> > prefer microvm with pcie support over q35?
> 
> The usual reasons to use pcie apply to microvm just the same.
> E.g.: pass through of pcie devices?

Playground:
  https://git.kraxel.org/cgit/qemu/log/?h=sirius/microvm-usb

Adds support for usb and pcie (use -machine microvm,usb=on,pcie=on
to enable).  Reuses the gpex used on arm/aarch64.  Seems to work ok
on a quick test.

Not fully sure how to deal correctly with ioports.  The gpex device
has a mmio window for the io address space.  Will that approach work
on x86 too?  Anyway, just not having a ioport range seems to be a
valid configuation, so I've just disabled them for now ...

take care,
  Gerd


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-11 17:00                   ` Sean Christopherson
@ 2020-09-18 12:34                     ` Michael S. Tsirkin
  2020-09-21 17:21                       ` Sean Christopherson
  0 siblings, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2020-09-18 12:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Peter Xu, kvm, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel, Gerd Hoffmann

On Fri, Sep 11, 2020 at 10:00:31AM -0700, Sean Christopherson wrote:
> On Mon, Sep 07, 2020 at 07:32:23AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Sep 07, 2020 at 10:37:39AM +0200, Vitaly Kuznetsov wrote:
> > > Sean Christopherson <sean.j.christopherson@intel.com> writes:
> > > 
> > > > On Fri, Sep 04, 2020 at 09:29:05AM +0200, Gerd Hoffmann wrote:
> > > >>   Hi,
> > > >> 
> > > >> > Unless I'm mistaken, microvm doesn't even support PCI, does it?
> > > >> 
> > > >> Correct, no pci support right now.
> > > >> 
> > > >> We could probably wire up ecam (arm/virt style) for pcie support, once
> > > >> the acpi support for mictovm finally landed (we need acpi for that
> > > >> because otherwise the kernel wouldn't find the pcie bus).
> > > >> 
> > > >> Question is whenever there is a good reason to do so.  Why would someone
> > > >> prefer microvm with pcie support over q35?
> > > >> 
> > > >> > If all of the above is true, this can be handled by adding "pci=lastbus=0"
> > > >> > as a guest kernel param to override its scanning of buses.  And couldn't
> > > >> > that be done by QEMU's microvm_fix_kernel_cmdline() to make it transparent
> > > >> > to the end user?
> > > >> 
> > > >> microvm_fix_kernel_cmdline() is a hack, not a solution.
> > > >> 
> > > >> Beside that I doubt this has much of an effect on microvm because
> > > >> it doesn't support pcie in the first place.
> > > >
> > > > I am so confused.  Vitaly, can you clarify exactly what QEMU VM type this
> > > > series is intended to help?  If this is for microvm, then why is the guest
> > > > doing PCI scanning in the first place?  If it's for q35, why is the
> > > > justification for microvm-like workloads?
> > > 
> > > I'm not exactly sure about the plans for particular machine types, the
> > > intention was to use this for pcie in QEMU in general so whatever
> > > machine type uses pcie will benefit. 
> > > 
> > > Now, it seems that we have a more sophisticated landscape. The
> > > optimization will only make sense to speed up boot so all 'traditional'
> > > VM types with 'traditional' firmware are out of
> > > question. 'Container-like' VMs seem to avoid PCI for now, I'm not sure
> > > if it's because they're in early stages of their development, because
> > > they can get away without PCI or, actually, because of slowness at boot
> > > (which we're trying to tackle with this feature). I'd definitely like to
> > > hear more what people think about this.
> > 
> > I suspect microvms will need pci eventually. I would much rather KVM
> > had an exit-less discovery mechanism in place by then because
> > learning from history if it doesn't they will do some kind of
> > hack on the kernel command line, and everyone will be stuck
> > supporting that for years ...
> 
> Is it not an option for the VMM to "accurately" enumerate the number of buses?
> E.g. if the VMM has devices on only bus 0, then enumerate that there is one
> bus so that the guest doesn't try and probe devices that can't possibly exist.
> Or is that completely non-sensical and/or violate PCIe spec?


There is some tension here, in that one way to make guest boot faster
is to defer hotplug of devices until after it booted.

-- 
MST


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory
  2020-09-18 12:34                     ` Michael S. Tsirkin
@ 2020-09-21 17:21                       ` Sean Christopherson
  0 siblings, 0 replies; 31+ messages in thread
From: Sean Christopherson @ 2020-09-21 17:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vitaly Kuznetsov, Peter Xu, kvm, Paolo Bonzini, Wanpeng Li,
	Jim Mattson, Julia Suvorova, Andy Lutomirski, Andrew Jones,
	linux-kernel, Gerd Hoffmann

On Fri, Sep 18, 2020 at 08:34:37AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 11, 2020 at 10:00:31AM -0700, Sean Christopherson wrote:
> > On Mon, Sep 07, 2020 at 07:32:23AM -0400, Michael S. Tsirkin wrote:
> > > I suspect microvms will need pci eventually. I would much rather KVM
> > > had an exit-less discovery mechanism in place by then because
> > > learning from history if it doesn't they will do some kind of
> > > hack on the kernel command line, and everyone will be stuck
> > > supporting that for years ...
> > 
> > Is it not an option for the VMM to "accurately" enumerate the number of buses?
> > E.g. if the VMM has devices on only bus 0, then enumerate that there is one
> > bus so that the guest doesn't try and probe devices that can't possibly exist.
> > Or is that completely non-sensical and/or violate PCIe spec?
> 
> 
> There is some tension here, in that one way to make guest boot faster
> is to defer hotplug of devices until after it booted.

Sorry, I didn't follow that, probably because my PCI knowledge is lacking.
What does device hotplug have to do with the number of buses enumerated to
the guest?

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-09-21 17:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 14:12 [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
2020-08-07 14:12 ` [PATCH v2 1/3] KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf() Vitaly Kuznetsov
2020-08-14  1:40   ` Sean Christopherson
2020-09-01 14:15     ` Vitaly Kuznetsov
2020-09-04  3:47       ` Sean Christopherson
2020-08-07 14:12 ` [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory Vitaly Kuznetsov
2020-08-14  2:31   ` Sean Christopherson
2020-08-14 14:30     ` Michael S. Tsirkin
2020-08-17 16:32       ` Sean Christopherson
2020-08-21  1:46         ` Michael S. Tsirkin
2020-08-22  3:19           ` Sean Christopherson
2020-09-01 14:39     ` Vitaly Kuznetsov
2020-09-03  9:35       ` Michael S. Tsirkin
2020-08-07 14:12 ` [PATCH v2 3/3] KVM: selftests: add KVM_MEM_PCI_HOLE test Vitaly Kuznetsov
2020-08-25 21:25 ` [PATCH v2 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory Peter Xu
2020-09-01 14:43   ` Vitaly Kuznetsov
2020-09-01 20:00     ` Peter Xu
2020-09-02  8:59       ` Vitaly Kuznetsov
2020-09-04  6:12         ` Sean Christopherson
2020-09-04  7:29           ` Gerd Hoffmann
2020-09-04 16:00             ` Sean Christopherson
2020-09-07  8:37               ` Vitaly Kuznetsov
2020-09-07 11:32                 ` Michael S. Tsirkin
2020-09-11 17:00                   ` Sean Christopherson
2020-09-18 12:34                     ` Michael S. Tsirkin
2020-09-21 17:21                       ` Sean Christopherson
2020-09-07 10:52             ` Michael S. Tsirkin
2020-09-18  9:33               ` Gerd Hoffmann
2020-09-07 10:54           ` Michael S. Tsirkin
2020-09-18  8:49           ` Gerd Hoffmann
2020-09-04  7:20   ` Gerd Hoffmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).