All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
@ 2022-05-19 15:37 Chao Peng
  2022-05-19 15:37 ` [PATCH v6 1/8] mm: Introduce memfile_notifier Chao Peng
                   ` (8 more replies)
  0 siblings, 9 replies; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

This is the v6 of this series which tries to implement the fd-based KVM
guest private memory. The patches are based on latest kvm/queue branch
commit:

  2764011106d0 (kvm/queue) KVM: VMX: Include MKTME KeyID bits in
shadow_zero_check
 
and Sean's below patch:

  KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"
  https://lkml.org/lkml/2022/4/22/1598

Introduction
------------
In general this patch series introduce fd-based memslot which provides
guest memory through memory file descriptor fd[offset,size] instead of
hva/size. The fd can be created from a supported memory filesystem
like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
and the the memory backing store exchange callbacks when such memslot
gets created. At runtime KVM will call into callbacks provided by the
backing store to get the pfn with the fd+offset. Memory backing store
will also call into KVM callbacks when userspace fallocate/punch hole
on the fd to notify KVM to map/unmap secondary MMU page tables.

Comparing to existing hva-based memslot, this new type of memslot allows
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and prevent bugs.

Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the memory backing store can provide
more enforcement on the fd and KVM can use a single memslot to hold both
the private and shared part of the guest memory. 

mm extension
---------------------
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created
with these flags cannot read(), write() or mmap() etc via normal
MMU operations. The file content can only be used with the newly
introduced memfile_notifier extension.

The memfile_notifier extension provides two sets of callbacks for KVM to
interact with the memory backing store:
  - memfile_notifier_ops: callbacks for memory backing store to notify
    KVM when memory gets allocated/invalidated.
  - backing store callbacks: callbacks for KVM to call into memory backing
    store to request memory pages for guest private memory.

The memfile_notifier extension also provides APIs for memory backing
store to register/unregister itself and to trigger the notifier when the
bookmarked memory gets fallocated/invalidated.

memslot extension
-----------------
Add the private fd and the fd offset to existing 'shared' memslot so that
both private/shared guest memory can live in one single memslot. A page in
the memslot is either private or shared. A page is private only when it's
already allocated in the backing store fd, all the other cases it's treated
as shared, this includes those already mapped as shared as well as those
having not been mapped. This means the memory backing store is the place
which tells the truth of which page is private.

Private memory map/unmap and conversion
---------------------------------------
Userspace's map/unmap operations are done by fallocate() ioctl on the
backing store fd.
  - map: default fallocate() with mode=0.
  - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
secondary MMU page tables.

Test
----
To test the new functionalities of this patch TDX patchset is needed.
Since TDX patchset has not been merged so I did two kinds of test:

-  Selftest on normal VM from Vishal
   https://lkml.org/lkml/2022/5/10/2045
   The selftest has been ported to this patchset and you can find it in
   repo: https://github.com/chao-p/linux/tree/privmem-v6

-  Private memory funational test on latest TDX code
   The patch is rebased to latest TDX code and tested the new
   funcationalities. See below repos:
   Linux: https://github.com/chao-p/linux/commits/privmem-v6-tdx
   QEMU: https://github.com/chao-p/qemu/tree/privmem-v6

An example QEMU command line for TDX test:
-object tdx-guest,id=tdx \
-object memory-backend-memfd-private,id=ram1,size=2G \
-machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1

What's missing
--------------
  - The accounting for longterm pinned memory in the backing store is
    not included since I havn't come out a good solution yet.
  - Batch invalidation notify for shmem is not ready, as I still see
    it's a bit tricky to do that clearly.

Changelog
----------
v6:
  - Re-organzied patch for both mm/KVM parts.
  - Added flags for memfile_notifier so its consumers can state their
    features and memory backing store can check against these flags.
  - Put a backing store reference in the memfile_notifier and move pfn_ops
    into backing store.
  - Only support boot time backing store register.
  - Overall KVM part improvement suggested by Sean and some others.
v5:
  - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
    in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
    be created by MFD_INACCESSIBLE.
  - Introduced new APIs for backing store to register itself to
    memfile_notifier instead of direct function call.
  - Added the accounting and restriction for MFD_INACCESSIBLE memory.
  - Added KVM API doc for new memslot extensions and man page for the new
    MFD_INACCESSIBLE flag.
  - Removed the overlap check for mapping the same file+offset into
    multiple gfns due to perf consideration, warned in document.
  - Addressed other comments in v4.
v4:
  - Decoupled the callbacks between KVM/mm from memfd and use new
    name 'memfile_notifier'.
  - Supported register multiple memslots to the same backing store.
  - Added per-memslot pfn_ops instead of per-system.
  - Reworked the invalidation part.
  - Improved new KVM uAPIs (private memslot extension and memory
    error) per Sean's suggestions.
  - Addressed many other minor fixes for comments from v3.
v3:
  - Added locking protection when calling
    invalidate_page_range/fallocate callbacks.
  - Changed memslot structure to keep use useraddr for shared memory.
  - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
  - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
  - Commit message improvement.
  - Many small fixes for comments from the last version.

Links to previous discussions
-----------------------------
[1] Original design proposal:
https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
[2] Updated proposal and RFC patch v1:
https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
[3] Patch v5: https://lkml.org/lkml/2022/3/10/457

Chao Peng (6):
  mm: Introduce memfile_notifier
  mm/memfd: Introduce MFD_INACCESSIBLE flag
  KVM: Extend the memslot to support fd-based private memory
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Handle page fault for private memory
  KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
  mm/shmem: Support memfile_notifier

 Documentation/virt/kvm/api.rst   |  60 ++++++++++--
 arch/mips/include/asm/kvm_host.h |   2 +-
 arch/x86/include/asm/kvm_host.h  |   2 +-
 arch/x86/kvm/Kconfig             |   2 +
 arch/x86/kvm/mmu.h               |   1 +
 arch/x86/kvm/mmu/mmu.c           |  70 +++++++++++++-
 arch/x86/kvm/mmu/mmu_internal.h  |  17 ++++
 arch/x86/kvm/mmu/mmutrace.h      |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h   |   5 +-
 arch/x86/kvm/x86.c               |   2 +-
 include/linux/kvm_host.h         |  51 +++++++++--
 include/linux/memfile_notifier.h |  99 ++++++++++++++++++++
 include/linux/shmem_fs.h         |   2 +
 include/uapi/linux/kvm.h         |  33 +++++++
 include/uapi/linux/memfd.h       |   1 +
 mm/Kconfig                       |   4 +
 mm/Makefile                      |   1 +
 mm/memfd.c                       |  15 ++-
 mm/memfile_notifier.c            | 137 +++++++++++++++++++++++++++
 mm/shmem.c                       | 120 +++++++++++++++++++++++-
 virt/kvm/Kconfig                 |   3 +
 virt/kvm/kvm_main.c              | 153 +++++++++++++++++++++++++++++--
 22 files changed, 748 insertions(+), 33 deletions(-)
 create mode 100644 include/linux/memfile_notifier.h
 create mode 100644 mm/memfile_notifier.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v6 1/8] mm: Introduce memfile_notifier
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
@ 2022-05-19 15:37 ` Chao Peng
  2022-05-19 15:37 ` [PATCH v6 2/8] mm/shmem: Support memfile_notifier Chao Peng
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

This patch introduces memfile_notifier facility so existing memory file
subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
third kernel component to make use of memory bookmarked in the memory
file and gets notified when the pages in the memory file become
allocated/invalidated.

It will be used for KVM to use a file descriptor as the guest memory
backing store and KVM will use this memfile_notifier interface to
interact with memory file subsystems. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).

It consists below components:
 - memfile_backing_store: Each supported memory file subsystem can be
   implemented as a memory backing store which bookmarks memory and
   provides callbacks for other kernel systems (memfile_notifier
   consumers) to interact with.
 - memfile_notifier: memfile_notifier consumers defines callbacks and
   associate them to a file using memfile_register_notifier().
 - memfile_node: A memfile_node is associated with the file (inode) from
   the backing store and includes feature flags and a list of registered
   memfile_notifier for notifying.

Userspace is in charge of guest memory lifecycle: it first allocates
pages in memory backing store and then passes the fd to KVM and lets KVM
register memory slot to memory backing store via memfile_register_notifier.

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfile_notifier.h |  99 ++++++++++++++++++++++
 mm/Kconfig                       |   4 +
 mm/Makefile                      |   1 +
 mm/memfile_notifier.c            | 137 +++++++++++++++++++++++++++++++
 4 files changed, 241 insertions(+)
 create mode 100644 include/linux/memfile_notifier.h
 create mode 100644 mm/memfile_notifier.c

diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h
new file mode 100644
index 000000000000..dcb3ee6ed626
--- /dev/null
+++ b/include/linux/memfile_notifier.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMFILE_NOTIFIER_H
+#define _LINUX_MEMFILE_NOTIFIER_H
+
+#include <linux/pfn_t.h>
+#include <linux/rculist.h>
+#include <linux/spinlock.h>
+#include <linux/srcu.h>
+#include <linux/fs.h>
+
+
+#define MEMFILE_F_USER_INACCESSIBLE	BIT(0)	/* memory allocated in the file is inaccessible from userspace (e.g. read/write/mmap) */
+#define MEMFILE_F_UNMOVABLE		BIT(1)	/* memory allocated in the file is unmovable (e.g. via pagemigration)*/
+#define MEMFILE_F_UNRECLAIMABLE		BIT(2)	/* memory allocated in the file is unreclaimable (e.g. via kswapd) */
+
+#define MEMFILE_F_ALLOWED_MASK		(MEMFILE_F_USER_INACCESSIBLE | \
+					MEMFILE_F_UNMOVABLE | \
+					MEMFILE_F_UNRECLAIMABLE)
+
+struct memfile_node {
+	struct list_head	notifiers;	/* registered memfile_notifier list on the file */
+	unsigned long		flags;		/* MEMFILE_F_* flags */
+};
+
+struct memfile_backing_store {
+	struct list_head list;
+	spinlock_t lock;
+	struct memfile_node* (*lookup_memfile_node)(struct file *file);
+	int (*get_lock_pfn)(struct file *file, pgoff_t offset, pfn_t *pfn,
+			    int *order);
+	void (*put_unlock_pfn)(pfn_t pfn);
+};
+
+struct memfile_notifier;
+struct memfile_notifier_ops {
+	void (*populate)(struct memfile_notifier *notifier,
+			 pgoff_t start, pgoff_t end);
+	void (*invalidate)(struct memfile_notifier *notifier,
+			   pgoff_t start, pgoff_t end);
+};
+
+struct memfile_notifier {
+	struct list_head list;
+	struct memfile_notifier_ops *ops;
+	struct memfile_backing_store *bs;
+};
+
+static inline void memfile_node_init(struct memfile_node *node)
+{
+	INIT_LIST_HEAD(&node->notifiers);
+	node->flags = 0;
+}
+
+#ifdef CONFIG_MEMFILE_NOTIFIER
+/* APIs for backing stores */
+extern void memfile_register_backing_store(struct memfile_backing_store *bs);
+extern int memfile_node_set_flags(struct file *file, unsigned long flags);
+extern void memfile_notifier_populate(struct memfile_node *node,
+				      pgoff_t start, pgoff_t end);
+extern void memfile_notifier_invalidate(struct memfile_node *node,
+					pgoff_t start, pgoff_t end);
+/*APIs for notifier consumers */
+extern int memfile_register_notifier(struct file *file, unsigned long flags,
+				     struct memfile_notifier *notifier);
+extern void memfile_unregister_notifier(struct memfile_notifier *notifier);
+
+#else /* !CONFIG_MEMFILE_NOTIFIER */
+static void memfile_register_backing_store(struct memfile_backing_store *bs)
+{
+}
+
+static int memfile_node_set_flags(struct file *file, unsigned long flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static void memfile_notifier_populate(struct memfile_node *node,
+				      pgoff_t start, pgoff_t end)
+{
+}
+
+static void memfile_notifier_invalidate(struct memfile_node *node,
+					pgoff_t start, pgoff_t end)
+{
+}
+
+static int memfile_register_notifier(struct file *file, flags,
+				     struct memfile_notifier *notifier)
+{
+	return -EOPNOTSUPP;
+}
+
+static void memfile_unregister_notifier(struct memfile_notifier *notifier)
+{
+}
+
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
+#endif /* _LINUX_MEMFILE_NOTIFIER_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..e551e99cd42a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -909,6 +909,10 @@ config ANON_VMA_NAME
 	  area from being merged with adjacent virtual memory areas due to the
 	  difference in their name.
 
+config MEMFILE_NOTIFIER
+	bool
+	select SRCU
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 4cc13f3179a5..261a5cb315f9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -133,3 +133,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
 obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
+obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o
diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c
new file mode 100644
index 000000000000..ab9461cb874e
--- /dev/null
+++ b/mm/memfile_notifier.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/mm/memfile_notifier.c
+ *
+ *  Copyright (C) 2022  Intel Corporation.
+ *             Chao Peng <chao.p.peng@linux.intel.com>
+ */
+
+#include <linux/memfile_notifier.h>
+#include <linux/pagemap.h>
+#include <linux/srcu.h>
+
+DEFINE_STATIC_SRCU(memfile_srcu);
+static __ro_after_init LIST_HEAD(backing_store_list);
+
+void memfile_notifier_populate(struct memfile_node *node,
+			       pgoff_t start, pgoff_t end)
+{
+	struct memfile_notifier *notifier;
+	int id;
+
+	id = srcu_read_lock(&memfile_srcu);
+	list_for_each_entry_srcu(notifier, &node->notifiers, list,
+				 srcu_read_lock_held(&memfile_srcu)) {
+		if (notifier->ops->populate)
+			notifier->ops->populate(notifier, start, end);
+	}
+	srcu_read_unlock(&memfile_srcu, id);
+}
+
+void memfile_notifier_invalidate(struct memfile_node *node,
+				 pgoff_t start, pgoff_t end)
+{
+	struct memfile_notifier *notifier;
+	int id;
+
+	id = srcu_read_lock(&memfile_srcu);
+	list_for_each_entry_srcu(notifier, &node->notifiers, list,
+				 srcu_read_lock_held(&memfile_srcu)) {
+		if (notifier->ops->invalidate)
+			notifier->ops->invalidate(notifier, start, end);
+	}
+	srcu_read_unlock(&memfile_srcu, id);
+}
+
+void __init memfile_register_backing_store(struct memfile_backing_store *bs)
+{
+	spin_lock_init(&bs->lock);
+	list_add_tail(&bs->list, &backing_store_list);
+}
+
+static void memfile_node_update_flags(struct file *file, unsigned long flags)
+{
+	struct address_space *mapping = file_inode(file)->i_mapping;
+	gfp_t gfp;
+
+	gfp = mapping_gfp_mask(mapping);
+	if (flags & MEMFILE_F_UNMOVABLE)
+		gfp &= ~__GFP_MOVABLE;
+	else
+		gfp |= __GFP_MOVABLE;
+	mapping_set_gfp_mask(mapping, gfp);
+
+	if (flags & MEMFILE_F_UNRECLAIMABLE)
+		mapping_set_unevictable(mapping);
+	else
+		mapping_clear_unevictable(mapping);
+}
+
+int memfile_node_set_flags(struct file *file, unsigned long flags)
+{
+	struct memfile_backing_store *bs;
+	struct memfile_node *node;
+
+	if (flags & ~MEMFILE_F_ALLOWED_MASK)
+		return -EINVAL;
+
+	list_for_each_entry(bs, &backing_store_list, list) {
+		node = bs->lookup_memfile_node(file);
+		if (node) {
+			spin_lock(&bs->lock);
+			node->flags = flags;
+			spin_unlock(&bs->lock);
+			memfile_node_update_flags(file, flags);
+			return 0;
+		}
+	}
+
+	return -EOPNOTSUPP;
+}
+
+int memfile_register_notifier(struct file *file, unsigned long flags,
+			      struct memfile_notifier *notifier)
+{
+	struct memfile_backing_store *bs;
+	struct memfile_node *node;
+	struct list_head *list;
+
+	if (!file || !notifier || !notifier->ops)
+		return -EINVAL;
+	if (flags & ~MEMFILE_F_ALLOWED_MASK)
+		return -EINVAL;
+
+	list_for_each_entry(bs, &backing_store_list, list) {
+		node = bs->lookup_memfile_node(file);
+		if (node) {
+			list = &node->notifiers;
+			notifier->bs = bs;
+
+			spin_lock(&bs->lock);
+			if (list_empty(list))
+				node->flags = flags;
+			else if (node->flags ^ flags) {
+				spin_unlock(&bs->lock);
+				return -EINVAL;
+			}
+
+			list_add_rcu(&notifier->list, list);
+			spin_unlock(&bs->lock);
+			memfile_node_update_flags(file, flags);
+			return 0;
+		}
+	}
+
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(memfile_register_notifier);
+
+void memfile_unregister_notifier(struct memfile_notifier *notifier)
+{
+	spin_lock(&notifier->bs->lock);
+	list_del_rcu(&notifier->list);
+	spin_unlock(&notifier->bs->lock);
+
+	synchronize_srcu(&memfile_srcu);
+}
+EXPORT_SYMBOL_GPL(memfile_unregister_notifier);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 2/8] mm/shmem: Support memfile_notifier
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
  2022-05-19 15:37 ` [PATCH v6 1/8] mm: Introduce memfile_notifier Chao Peng
@ 2022-05-19 15:37 ` Chao Peng
  2022-05-19 15:37 ` [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Implement shmem as a memfile_notifier backing store. Essentially it
interacts with the memfile_notifier feature flags for userspace
access/page migration/page reclaiming and implements the necessary
memfile_backing_store callbacks.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/shmem_fs.h |   2 +
 mm/shmem.c               | 120 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ab51d3cd39bd..a8e98bdd121e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -9,6 +9,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/xattr.h>
 #include <linux/fs_parser.h>
+#include <linux/memfile_notifier.h>
 
 /* inode in-kernel data */
 
@@ -25,6 +26,7 @@ struct shmem_inode_info {
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
 	struct timespec64	i_crtime;	/* file creation time */
+	struct memfile_node	memfile_node;	/* memfile node */
 	struct inode		vfs_inode;
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 529c9ad3e926..f97ae328c87a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -905,6 +905,24 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	return page ? page_folio(page) : NULL;
 }
 
+static void notify_populate(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	memfile_notifier_populate(&info->memfile_node, start, end);
+}
+
+static void notify_invalidate(struct inode *inode, struct folio *folio,
+				   pgoff_t start, pgoff_t end)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	start = max(start, folio->index);
+	end = min(end, folio->index + folio_nr_pages(folio));
+
+	memfile_notifier_invalidate(&info->memfile_node, start, end);
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -948,6 +966,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += folio_nr_pages(folio) - 1;
 
+			notify_invalidate(inode, folio, start, end);
+
 			if (!unfalloc || !folio_test_uptodate(folio))
 				truncate_inode_folio(mapping, folio);
 			folio_unlock(folio);
@@ -1021,6 +1041,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				notify_invalidate(inode, folio, start, end);
+
 				VM_BUG_ON_FOLIO(folio_test_writeback(folio),
 						folio);
 				truncate_inode_folio(mapping, folio);
@@ -1092,6 +1115,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
+		if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) {
+			if(oldsize)
+				return -EPERM;
+			if (!PAGE_ALIGNED(newsize))
+				return -EINVAL;
+		}
+
 		if (newsize != oldsize) {
 			error = shmem_reacct_size(SHMEM_I(inode)->flags,
 					oldsize, newsize);
@@ -1340,6 +1370,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		goto redirty;
 	if (!total_swap_pages)
 		goto redirty;
+	if (info->memfile_node.flags & MEMFILE_F_UNRECLAIMABLE)
+		goto redirty;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -2234,6 +2266,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE)
+		return -EPERM;
+
 	/* arm64 - allow memory tagging on RAM-based files */
 	vma->vm_flags |= VM_MTE_ALLOWED;
 
@@ -2274,6 +2309,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		info->i_crtime = inode->i_mtime;
 		INIT_LIST_HEAD(&info->shrinklist);
 		INIT_LIST_HEAD(&info->swaplist);
+		memfile_node_init(&info->memfile_node);
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
 		mapping_set_large_folios(inode->i_mapping);
@@ -2442,6 +2478,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
 	}
+	if (unlikely(info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE))
+		return -EPERM;
 
 	ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
 
@@ -2518,6 +2556,13 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
 			break;
+
+		if (SHMEM_I(inode)->memfile_node.flags &
+				MEMFILE_F_USER_INACCESSIBLE) {
+			error = -EPERM;
+			break;
+		}
+
 		if (index == end_index) {
 			nr = i_size & ~PAGE_MASK;
 			if (nr <= offset)
@@ -2649,6 +2694,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto out;
 		}
 
+		if ((info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) &&
+		    (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) {
+			error = -EINVAL;
+			goto out;
+		}
+
 		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -2768,6 +2819,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
 		i_size_write(inode, offset + len);
 	inode->i_ctime = current_time(inode);
+	notify_populate(inode, start, end);
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
@@ -3754,6 +3806,20 @@ static int shmem_error_remove_page(struct address_space *mapping,
 	return 0;
 }
 
+#ifdef CONFIG_MIGRATION
+static int shmem_migrate_page(struct address_space *mapping,
+			      struct page *newpage, struct page *page,
+			      enum migrate_mode mode)
+{
+	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
+		return -ENOTSUPP;
+	return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.dirty_folio	= noop_dirty_folio,
@@ -3762,7 +3828,7 @@ const struct address_space_operations shmem_aops = {
 	.write_end	= shmem_write_end,
 #endif
 #ifdef CONFIG_MIGRATION
-	.migratepage	= migrate_page,
+	.migratepage	= shmem_migrate_page,
 #endif
 	.error_remove_page = shmem_error_remove_page,
 };
@@ -3879,6 +3945,54 @@ static struct file_system_type shmem_fs_type = {
 	.fs_flags	= FS_USERNS_MOUNT,
 };
 
+#ifdef CONFIG_MEMFILE_NOTIFIER
+static struct memfile_node* shmem_lookup_memfile_node(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+
+	if (!shmem_mapping(inode->i_mapping))
+		return NULL;
+
+	return  &SHMEM_I(inode)->memfile_node;
+}
+
+
+static int shmem_get_lock_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			      int *order)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(file_inode(file), offset, &page, SGP_NOALLOC);
+	if (ret)
+		return ret;
+
+	*pfn = page_to_pfn_t(page);
+	*order = thp_order(compound_head(page));
+	return 0;
+}
+
+static void shmem_put_unlock_pfn(pfn_t pfn)
+{
+	struct page *page = pfn_t_to_page(pfn);
+
+	if (!page)
+		return;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	set_page_dirty(page);
+	unlock_page(page);
+	put_page(page);
+}
+
+static struct memfile_backing_store shmem_backing_store = {
+	.lookup_memfile_node = shmem_lookup_memfile_node,
+	.get_lock_pfn = shmem_get_lock_pfn,
+	.put_unlock_pfn = shmem_put_unlock_pfn,
+};
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
 int __init shmem_init(void)
 {
 	int error;
@@ -3904,6 +4018,10 @@ int __init shmem_init(void)
 	else
 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
 #endif
+
+#ifdef CONFIG_MEMFILE_NOTIFIER
+	memfile_register_backing_store(&shmem_backing_store);
+#endif
 	return 0;
 
 out1:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
  2022-05-19 15:37 ` [PATCH v6 1/8] mm: Introduce memfile_notifier Chao Peng
  2022-05-19 15:37 ` [PATCH v6 2/8] mm/shmem: Support memfile_notifier Chao Peng
@ 2022-05-19 15:37 ` Chao Peng
  2022-05-31 19:15   ` Vishal Annapurve
  2022-05-19 15:37 ` [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

Introduce a new memfd_create() flag indicating the content of the
created memfd is inaccessible from userspace through ordinary MMU
access (e.g., read/write/mmap). However, the file content can be
accessed via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this flag set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
also impossible for a memfd created with this flag.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..775541d53f1b 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -18,6 +18,7 @@
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
 #include <linux/memfd.h>
+#include <linux/memfile_notifier.h>
 #include <uapi/linux/memfd.h>
 
 /*
@@ -261,7 +262,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -283,6 +285,10 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -329,12 +335,19 @@ SYSCALL_DEFINE2(memfd_create,
 	if (flags & MFD_ALLOW_SEALING) {
 		file_seals = memfd_file_seals_ptr(file);
 		*file_seals &= ~F_SEAL_SEAL;
+	} else if (flags & MFD_INACCESSIBLE) {
+		error = memfile_node_set_flags(file,
+					       MEMFILE_F_USER_INACCESSIBLE);
+		if (error)
+			goto err_file;
 	}
 
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
 
+err_file:
+	fput(file);
 err_fd:
 	put_unused_fd(fd);
 err_name:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (2 preceding siblings ...)
  2022-05-19 15:37 ` [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
@ 2022-05-19 15:37 ` Chao Peng
  2022-05-20 17:57   ` Andy Lutomirski
  2022-06-17 20:52   ` Sean Christopherson
  2022-05-19 15:37 ` [PATCH v6 5/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

Extend the memslot definition to provide guest private memory through a
file descriptor(fd) instead of userspace_addr(hva). Such guest private
memory(fd) may never be mapped into userspace so no userspace_addr(hva)
can be used. Instead add another two new fields
(private_fd/private_offset), plus the existing memory_size to represent
the private memory range. Such memslot can still have the existing
userspace_addr(hva). When use, a single memslot can maintain both
private memory through private fd(private_fd/private_offset) and shared
memory through hva(userspace_addr). A GPA is considered private by KVM
if the memslot has private fd and that corresponding page in the private
fd is populated, otherwise, it's shared.

Since there is no userspace mapping for private fd so we cannot
rely on get_user_pages() to get the pfn in KVM, instead we add a new
memfile_notifier in the memslot and rely on it to get pfn by interacting
its callbacks from memory backing store with the fd/offset.

This new extension is indicated by a new flag KVM_MEM_PRIVATE. At
compile time, a new config HAVE_KVM_PRIVATE_MEM is added and right now
it is selected on X86_64 for Intel TDX usage.

To make KVM easy, internally we use a binary compatible struct
kvm_user_mem_region to handle both the normal and the '_ext' variants.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst   | 38 ++++++++++++++++++++++++++------
 arch/mips/include/asm/kvm_host.h |  2 +-
 arch/x86/include/asm/kvm_host.h  |  2 +-
 arch/x86/kvm/Kconfig             |  2 ++
 arch/x86/kvm/x86.c               |  2 +-
 include/linux/kvm_host.h         | 19 +++++++++++-----
 include/uapi/linux/kvm.h         | 24 ++++++++++++++++++++
 virt/kvm/Kconfig                 |  3 +++
 virt/kvm/kvm_main.c              | 33 +++++++++++++++++++++------
 9 files changed, 103 insertions(+), 22 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 23baf7fce038..b959445b64cc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1311,7 +1311,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1324,9 +1324,18 @@ yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1357,12 +1366,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
+fields. It also includes additional fields for some specific features. See
+below description of flags field for more information. It's recommended to use
+kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports below flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
+  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
+
+- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
+  make a new slot read-only.  In this case, writes to this memory will be posted
+  to userspace as KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
+  a file descirptor(fd) and the content of the private memory is invisible to
+  userspace. In this case, userspace should use private_fd/private_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
+  guest. Userspace should guarantee not to map the same pfn indicated by
+  private_fd/private_offset to different gfns with multiple memslots. Failed to
+  do this may result undefined behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 717716cc51c5..45a978c805bc 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -85,7 +85,7 @@
 
 #define KVM_MAX_VCPUS		16
 /* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS	0
+#define KVM_INTERNAL_MEM_SLOTS	0
 
 #define KVM_HALT_POLL_NS_DEFAULT 500000
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c59fea4bdb6e..3f5e81ef77c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -53,7 +53,7 @@
 #define KVM_MAX_VCPU_IDS (KVM_MAX_VCPUS * KVM_VCPU_ID_RATIO)
 
 /* memory slots that are not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 3
+#define KVM_INTERNAL_MEM_SLOTS 3
 
 #define KVM_HALT_POLL_NS_DEFAULT 200000
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd7706136..1f160801e2a7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -48,6 +48,8 @@ config KVM
 	select SRCU
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select HAVE_KVM_PRIVATE_MEM if X86_64
+	select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ee8c91fa762..d873ae56b01a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11910,7 +11910,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_user_mem_region m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f94f72bbd2d3..3fd168972ecd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@
 
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/memfile_notifier.h>
 
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -573,8 +574,16 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *private_file;
+	loff_t private_offset;
+	struct memfile_notifier notifier;
 };
 
+static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -653,12 +662,12 @@ struct kvm_irq_routing_table {
 };
 #endif
 
-#ifndef KVM_PRIVATE_MEM_SLOTS
-#define KVM_PRIVATE_MEM_SLOTS 0
+#ifndef KVM_INTERNAL_MEM_SLOTS
+#define KVM_INTERNAL_MEM_SLOTS 0
 #endif
 
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
-#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
+#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
 #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
@@ -1087,9 +1096,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_user_mem_region *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_user_mem_region *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e10d131edd80..28cacd3656d4 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,29 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/* Internal helper, the layout must match above user visible structures */
+struct kvm_user_mem_region {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+#endif
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index a8c5c9f06b3c..ccaff13cc5b8 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK
 
 config HAVE_KVM_PM_NOTIFIER
        bool
+
+config HAVE_KVM_PRIVATE_MEM
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e089db822c12..db9d39a2d3a6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1830,7 +1830,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_user_mem_region *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -1934,7 +1934,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_user_mem_region *mem)
 {
 	int r;
 
@@ -1946,7 +1946,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_user_mem_region *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4500,14 +4500,33 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_user_mem_region mem;
+		unsigned long size;
+		u32 flags;
+
+		memset(&mem, 0, sizeof(mem));
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+
+		if (get_user(flags,
+			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
+			goto out;
+
+		if (flags & KVM_MEM_PRIVATE) {
+			r = -EINVAL;
+			goto out;
+		}
+
+		size = sizeof(struct kvm_userspace_memory_region);
+
+		if (copy_from_user(&mem, argp, size))
+			goto out;
+
+		r = -EINVAL;
+		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 5/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (3 preceding siblings ...)
  2022-05-19 15:37 ` [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-05-19 15:37 ` Chao Peng
  2022-05-19 15:37 ` [PATCH v6 6/8] KVM: Handle page fault for private memory Chao Peng
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

After private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared memory <-> private memory conversion in
memory encryption usage.

In such usage, typically there are two kind of memory conversions:
  - explicit conversion: happens when guest explicitly calls into KVM to
    map a range (as private or shared), KVM then exits to userspace to
    do the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler.
    * if the fault is due to a private memory access then causes a
      userspace exit for a shared->private conversion request when the
      page has not been allocated in the private memory backend.
    * If the fault is due to a shared memory access then causes a
      userspace exit for a private->shared conversion request when the
      page has already been allocated in the private memory backend.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  9 +++++++++
 2 files changed, 31 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index b959445b64cc..2421c012278b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6341,6 +6341,28 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
 		/* Fix the size of the union. */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 28cacd3656d4..6ca864be258f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -294,6 +294,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_MEMORY_FAULT     36
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -518,6 +519,14 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (4 preceding siblings ...)
  2022-05-19 15:37 ` [PATCH v6 5/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-05-19 15:37 ` Chao Peng
  2022-06-17 21:30   ` Sean Christopherson
  2022-06-24  3:58   ` Nikunj A. Dadhania
  2022-05-19 15:37 ` [PATCH v6 7/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

A page fault can carry the information of whether the access if private
or not for KVM_MEM_PRIVATE memslot, this can be filled by architecture
code(like TDX code). To handle page faut for such access, KVM maps the
page only when this private property matches host's view on this page
which can be decided by checking whether the corresponding page is
populated in the private fd or not. A page is considered as private when
the page is populated in the private fd, otherwise it's shared.

For a successful match, private pfn is obtained with memfile_notifier
callbacks from private fd and shared pfn is obtained with existing
get_user_pages.

For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared from
host's view then retry the access.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu.h              |  1 +
 arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/mmu_internal.h | 17 ++++++++
 arch/x86/kvm/mmu/mmutrace.h     |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  5 ++-
 include/linux/kvm_host.h        | 22 +++++++++++
 6 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 7e258cc94152..c84835762249 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -176,6 +176,7 @@ struct kvm_page_fault {
 
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
+	const bool is_private;
 	const bool nx_huge_page_workaround_enabled;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index afe18d70ece7..e18460e0d743 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2899,6 +2899,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
+	if (kvm_slot_is_private(slot))
+		return max_level;
+
 	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
 	return min(host_level, max_level);
 }
@@ -3948,10 +3951,54 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
+static inline u8 order_to_level(int order)
+{
+	enum pg_level level;
+
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--)
+		if (order >= page_level_shift(level) - PAGE_SHIFT)
+			return level;
+	return level;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				   struct kvm_page_fault *fault)
+{
+	int order;
+	struct kvm_memory_slot *slot = fault->slot;
+	bool private_exist = !kvm_private_mem_get_pfn(slot, fault->gfn,
+						      &fault->pfn, &order);
+
+	if (fault->is_private != private_exist) {
+		if (private_exist)
+			kvm_private_mem_put_pfn(slot, fault->pfn);
+
+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+		if (fault->is_private)
+			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+		else
+			vcpu->run->memory.flags = 0;
+		vcpu->run->memory.padding = 0;
+		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+		vcpu->run->memory.size = PAGE_SIZE;
+		return RET_PF_USER;
+	}
+
+	if (fault->is_private) {
+		fault->max_level = min(order_to_level(order), fault->max_level);
+		fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+		return RET_PF_FIXED;
+	}
+
+	/* Fault is shared, fallthrough to the standard path. */
+	return RET_PF_CONTINUE;
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
+	int r;
 
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -3980,6 +4027,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			return RET_PF_EMULATE;
 	}
 
+	if (kvm_slot_is_private(slot)) {
+		r = kvm_faultin_pfn_private(vcpu, fault);
+		if (r != RET_PF_CONTINUE)
+			return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
+	}
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
 					  fault->write, &fault->map_writable,
@@ -4028,8 +4081,11 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
 		return true;
 
-	return fault->slot &&
-	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+	if (fault->is_private)
+		return mmu_notifier_retry(vcpu->kvm, mmu_seq);
+	else
+		return fault->slot &&
+			mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -4088,7 +4144,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+
+	if (fault->is_private)
+		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
+
 	return r;
 }
 
@@ -5372,6 +5433,9 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 			return -EIO;
 	}
 
+	if (r == RET_PF_USER)
+		return 0;
+
 	if (r < 0)
 		return r;
 	if (r != RET_PF_EMULATE)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c0e502b17ef7..14932cf97655 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -147,6 +147,7 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
  * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
@@ -163,6 +164,7 @@ enum {
 	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_INVALID,
+	RET_PF_USER,
 	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
 };
@@ -178,4 +180,19 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+#ifndef CONFIG_HAVE_KVM_PRIVATE_MEM
+static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	WARN_ON_ONCE(1);
+	return -EOPNOTSUPP;
+}
+
+static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
+					   kvm_pfn_t pfn)
+{
+	WARN_ON_ONCE(1);
+}
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
 TRACE_DEFINE_ENUM(RET_PF_RETRY);
 TRACE_DEFINE_ENUM(RET_PF_EMULATE);
 TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
 TRACE_DEFINE_ENUM(RET_PF_FIXED);
 TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 7f8f1c8dbed2..1d857919a947 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -878,7 +878,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_private)
+		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3fd168972ecd..b0a7910505ed 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2241,4 +2241,26 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	int ret;
+	pfn_t pfnt;
+	pgoff_t index = gfn - slot->base_gfn +
+			(slot->private_offset >> PAGE_SHIFT);
+
+	ret = slot->notifier.bs->get_lock_pfn(slot->private_file, index, &pfnt,
+						order);
+	*pfn = pfn_t_to_pfn(pfnt);
+	return ret;
+}
+
+static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
+					   kvm_pfn_t pfn)
+{
+	slot->notifier.bs->put_unlock_pfn(pfn_to_pfn_t(pfn));
+}
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 7/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (5 preceding siblings ...)
  2022-05-19 15:37 ` [PATCH v6 6/8] KVM: Handle page fault for private memory Chao Peng
@ 2022-05-19 15:37 ` Chao Peng
  2022-06-23 22:07   ` Michael Roth
  2022-05-19 15:37 ` [PATCH v6 8/8] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
  2022-06-06 20:09 ` [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Vishal Annapurve
  8 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

Register private memslot to fd-based memory backing store and handle the
memfile notifiers to zap the existing mappings.

Currently the register is happened at memslot creating time and the
initial support does not include page migration/swap.

KVM_MEM_PRIVATE is not exposed by default, architecture code can turn
on it by implementing kvm_arch_private_mem_supported().

A 'kvm' reference is added in memslot structure since in
memfile_notifier callbacks we can only obtain a memslot reference while
kvm is need to do the zapping. The zapping itself reuses code from
existing mmu notifier handling.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  10 ++-
 virt/kvm/kvm_main.c      | 132 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 131 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b0a7910505ed..00efb4b96bc7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -246,7 +246,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER)
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -577,6 +577,7 @@ struct kvm_memory_slot {
 	struct file *private_file;
 	loff_t private_offset;
 	struct memfile_notifier notifier;
+	struct kvm *kvm;
 };
 
 static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
@@ -769,9 +770,13 @@ struct kvm {
 	struct hlist_head irq_ack_notifier_list;
 #endif
 
+#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) ||\
+	defined(CONFIG_MEMFILE_NOTIFIER)
+	unsigned long mmu_notifier_seq;
+#endif
+
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 	struct mmu_notifier mmu_notifier;
-	unsigned long mmu_notifier_seq;
 	long mmu_notifier_count;
 	unsigned long mmu_notifier_range_start;
 	unsigned long mmu_notifier_range_end;
@@ -1438,6 +1443,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_private_mem_supported(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index db9d39a2d3a6..f93ac7cdfb53 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -843,6 +843,73 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+static void kvm_private_mem_notifier_handler(struct memfile_notifier *notifier,
+					     pgoff_t start, pgoff_t end)
+{
+	int idx;
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	struct kvm_gfn_range gfn_range = {
+		.slot		= slot,
+		.start		= start - (slot->private_offset >> PAGE_SHIFT),
+		.end		= end - (slot->private_offset >> PAGE_SHIFT),
+		.may_block 	= true,
+	};
+	struct kvm *kvm = slot->kvm;
+
+	gfn_range.start = slot->base_gfn + gfn_range.start;
+	gfn_range.end = slot->base_gfn + min((unsigned long)gfn_range.end, slot->npages);
+
+	if (WARN_ON_ONCE(gfn_range.start >= gfn_range.end))
+		return;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	KVM_MMU_LOCK(kvm);
+	if (kvm_unmap_gfn_range(kvm, &gfn_range))
+		kvm_flush_remote_tlbs(kvm);
+	kvm->mmu_notifier_seq++;
+	KVM_MMU_UNLOCK(kvm);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+static struct memfile_notifier_ops kvm_private_mem_notifier_ops = {
+	.populate = kvm_private_mem_notifier_handler,
+	.invalidate = kvm_private_mem_notifier_handler,
+};
+
+#define KVM_MEMFILE_FLAGS MEMFILE_F_USER_INACCESSIBLE | \
+			  MEMFILE_F_UNMOVABLE | \
+			  MEMFILE_F_UNRECLAIMABLE
+
+static inline int kvm_private_mem_register(struct kvm_memory_slot *slot)
+{
+	slot->notifier.ops = &kvm_private_mem_notifier_ops;
+	return memfile_register_notifier(slot->private_file, KVM_MEMFILE_FLAGS,
+					 &slot->notifier);
+}
+
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
+{
+	memfile_unregister_notifier(&slot->notifier);
+}
+
+#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
+
+static inline int kvm_private_mem_register(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+	return -EOPNOTSUPP;
+}
+
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
@@ -887,6 +954,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 /* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	if (slot->flags & KVM_MEM_PRIVATE) {
+		kvm_private_mem_unregister(slot);
+		fput(slot->private_file);
+	}
+
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
@@ -1437,10 +1509,21 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
+{
+	return false;
+}
+
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	if (kvm_arch_private_mem_supported(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+#endif
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -1516,6 +1599,12 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 {
 	int r;
 
+	if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE) {
+		r = kvm_private_mem_register(new);
+		if (r)
+			return r;
+	}
+
 	/*
 	 * If dirty logging is disabled, nullify the bitmap; the old bitmap
 	 * will be freed on "commit".  If logging is enabled in both old and
@@ -1544,6 +1633,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 	if (r && new && new->dirty_bitmap && old && !old->dirty_bitmap)
 		kvm_destroy_dirty_bitmap(new);
 
+	if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+	    kvm_private_mem_unregister(new);
+
 	return r;
 }
 
@@ -1840,7 +1932,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
@@ -1859,6 +1951,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_PRIVATE &&
+		(mem->private_offset & (PAGE_SIZE - 1) ||
+		 mem->private_offset > U64_MAX - mem->memory_size))
+		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -1897,6 +1993,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -1925,10 +2024,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	if (mem->flags & KVM_MEM_PRIVATE) {
+		new->private_file = fget(mem->private_fd);
+		if (!new->private_file) {
+			r = -EINVAL;
+			goto out;
+		}
+		new->private_offset = mem->private_offset;
+	}
+
+	new->kvm = kvm;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
-		kfree(new);
+		goto out;
+
+	return 0;
+
+out:
+	if (new->private_file)
+		fput(new->private_file);
+	kfree(new);
 	return r;
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -4512,12 +4628,10 @@ static long kvm_vm_ioctl(struct file *filp,
 			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
 			goto out;
 
-		if (flags & KVM_MEM_PRIVATE) {
-			r = -EINVAL;
-			goto out;
-		}
-
-		size = sizeof(struct kvm_userspace_memory_region);
+		if (flags & KVM_MEM_PRIVATE)
+			size = sizeof(struct kvm_userspace_memory_region_ext);
+		else
+			size = sizeof(struct kvm_userspace_memory_region);
 
 		if (copy_from_user(&mem, argp, size))
 			goto out;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v6 8/8] memfd_create.2: Describe MFD_INACCESSIBLE flag
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (6 preceding siblings ...)
  2022-05-19 15:37 ` [PATCH v6 7/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-05-19 15:37 ` Chao Peng
  2022-06-06 20:09 ` [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Vishal Annapurve
  8 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-05-19 15:37 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 man2/memfd_create.2 | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
index 89e9c4136..2698222ae 100644
--- a/man2/memfd_create.2
+++ b/man2/memfd_create.2
@@ -101,6 +101,19 @@ meaning that no other seals can be set on the file.
 .\" FIXME Why is the MFD_ALLOW_SEALING behavior not simply the default?
 .\" Is it worth adding some text explaining this?
 .TP
+.BR MFD_INACCESSIBLE
+Disallow userspace access through ordinary MMU accesses via
+.BR read (2),
+.BR write (2)
+and
+.BR mmap (2).
+The file size cannot be changed once initialized.
+This flag cannot coexist with
+.B MFD_ALLOW_SEALING
+and when this flag is set, the initial set of seals will be
+.B F_SEAL_SEAL,
+meaning that no other seals can be set on the file.
+.TP
 .BR MFD_HUGETLB " (since Linux 4.14)"
 .\" commit 749df87bd7bee5a79cef073f5d032ddb2b211de8
 The anonymous file will be created in the hugetlbfs filesystem using
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-19 15:37 ` [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-05-20 17:57   ` Andy Lutomirski
  2022-05-20 18:31     ` Sean Christopherson
  2022-06-17 20:52   ` Sean Christopherson
  1 sibling, 1 reply; 58+ messages in thread
From: Andy Lutomirski @ 2022-05-20 17:57 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On 5/19/22 08:37, Chao Peng wrote:
> Extend the memslot definition to provide guest private memory through a
> file descriptor(fd) instead of userspace_addr(hva). Such guest private
> memory(fd) may never be mapped into userspace so no userspace_addr(hva)
> can be used. Instead add another two new fields
> (private_fd/private_offset), plus the existing memory_size to represent
> the private memory range. Such memslot can still have the existing
> userspace_addr(hva). When use, a single memslot can maintain both
> private memory through private fd(private_fd/private_offset) and shared
> memory through hva(userspace_addr). A GPA is considered private by KVM
> if the memslot has private fd and that corresponding page in the private
> fd is populated, otherwise, it's shared.
> 


So this is a strange API and, IMO, a layering violation.  I want to make 
sure that we're all actually on board with making this a permanent part 
of the Linux API.  Specifically, we end up with a multiplexing situation 
as you have described. For a given GPA, there are *two* possible host 
backings: an fd-backed one (from the fd, which is private for now might 
might end up potentially shared depending on future extensions) and a 
VMA-backed one.  The selection of which one backs the address is made 
internally by whatever backs the fd.

This, IMO, a clear layering violation.  Normally, an fd has an 
associated address space, and pages in that address space can have 
contents, can be holes that appear to contain all zeros, or could have 
holes that are inaccessible.  If you try to access a hole, you get 
whatever is in the hole.

But now, with this patchset, the fd is more of an overlay and you get 
*something else* if you try to access through the hole.

This results in operations on the fd bubbling up to the KVM mapping in 
what is, IMO, a strange way.  If the user punches a hole, KVM has to 
modify its mappings such that the GPA goes to whatever VMA may be there. 
  (And update the RMP, the hypervisor's tables, or whatever else might 
actually control privateness.)  Conversely, if the user does fallocate 
to fill a hole, the guest mapping *to an unrelated page* has to be 
zapped so that the fd's page shows up.  And the RMP needs updating, etc.

I am lukewarm on this for a few reasons.

1. This is weird.  AFAIK nothing else works like this.  Obviously this 
is subjecting, but "weird" and "layering violation" sometimes translate 
to "problematic locking".

2. fd-backed private memory can't have normal holes.  If I make a memfd, 
punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that 
reads as zero.  If I write to it, the page gets allocated.  But with 
this new mechanism, if I punch a hole and put it in a memslot, reads and 
writes go somewhere else.  So what if I actually wanted lazily allocated 
private zeros?

2b. For a hypothetical future extension in which an fd can also have 
shared pages (for conversion, for example, or simply because the fd 
backing might actually be more efficient than indirecting through VMAs 
and therefore get used for shared memory or entirely-non-confidential 
VMs), lazy fd-backed zeros sound genuinely useful.

3. TDX hardware capability is not fully exposed.  TDX can have a private 
page and a shared page at GPAs that differ only by the private bit. 
Sure, no one plans to use this today, but baking this into the user ABI 
throws away half the potential address space.

3b. Any software solution that works like TDX (which IMO seems like an 
eminently reasonable design to me) has the same issue.


The alternative would be to have some kind of separate table or bitmap 
(part of the memslot?) that tells KVM whether a GPA should map to the fd.

What do you all think?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-20 17:57   ` Andy Lutomirski
@ 2022-05-20 18:31     ` Sean Christopherson
  2022-05-22  4:03       ` Andy Lutomirski
                         ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Sean Christopherson @ 2022-05-20 18:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, May 20, 2022, Andy Lutomirski wrote:
> The alternative would be to have some kind of separate table or bitmap (part
> of the memslot?) that tells KVM whether a GPA should map to the fd.
> 
> What do you all think?

My original proposal was to have expolicit shared vs. private memslots, and punch
holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
handle memslot updates, conversions would be painfully slow.  That's how we ended
up with the current propsoal.

But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
and wouldn't necessarily even need to interact with the memslots.  It could be a
consumer of memslots, e.g. if we wanted to disallow registering regions without an
associated memslot, but I think we'd want to avoid even that because things will
get messy during memslot updates, e.g. if dirty logging is toggled or a shared
memory region is temporarily removed then we wouldn't want to destroy the tracking.

I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
should be far more efficient.

One benefit to explicitly tracking this in KVM is that it might be useful for
software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
based on guest hypercalls to share/unshare memory, and then complete the transaction
when userspace invokes the ioctl() to complete the share/unshare.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-20 18:31     ` Sean Christopherson
@ 2022-05-22  4:03       ` Andy Lutomirski
  2022-05-23 13:21       ` Chao Peng
  2022-06-23 22:59       ` Michael Roth
  2 siblings, 0 replies; 58+ messages in thread
From: Andy Lutomirski @ 2022-05-22  4:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A. Shutemov, Nakajima, Jun,
	Dave Hansen, Andi Kleen, David Hildenbrand, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, Michal Hocko



On Fri, May 20, 2022, at 11:31 AM, Sean Christopherson wrote:

> But a dedicated KVM ioctl() to add/remove shared ranges would be easy 
> to implement
> and wouldn't necessarily even need to interact with the memslots.  It 
> could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions 
> without an
> associated memslot, but I think we'd want to avoid even that because 
> things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a 
> shared
> memory region is temporarily removed then we wouldn't want to destroy 
> the tracking.
>
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.
>
> One benefit to explicitly tracking this in KVM is that it might be 
> useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray 
> as "pending"
> based on guest hypercalls to share/unshare memory, and then complete 
> the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

That makes sense.

If KVM goes this route, perhaps there the allowed states for a GPA should include private, shared, and also private-and-shared.  Then anyone who wanted to use the same masked GPA for shared and private on TDX could do so if they wanted to.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-20 18:31     ` Sean Christopherson
  2022-05-22  4:03       ` Andy Lutomirski
@ 2022-05-23 13:21       ` Chao Peng
  2022-05-23 15:22         ` Sean Christopherson
  2022-06-23 22:59       ` Michael Roth
  2 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-05-23 13:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> On Fri, May 20, 2022, Andy Lutomirski wrote:
> > The alternative would be to have some kind of separate table or bitmap (part
> > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > 
> > What do you all think?
> 
> My original proposal was to have expolicit shared vs. private memslots, and punch
> holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> handle memslot updates, conversions would be painfully slow.  That's how we ended
> up with the current propsoal.
> 
> But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> and wouldn't necessarily even need to interact with the memslots.  It could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions without an
> associated memslot, but I think we'd want to avoid even that because things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> memory region is temporarily removed then we wouldn't want to destroy the tracking.

Even we don't tight that to memslots, that info can only be effective
for private memslot, right? Setting this ioctl to memory ranges defined
in a traditional non-private memslots just makes no sense, I guess we can
comment that in the API document.

> 
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.

What about the mis-behaved guest? I don't want to design for the worst
case, but people may raise concern on the attack from such guest.

> 
> One benefit to explicitly tracking this in KVM is that it might be useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> based on guest hypercalls to share/unshare memory, and then complete the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

OK, then this can be another field of states/flags/attributes. Let me
dig up certain level of details:

First, introduce below KVM ioctl

KVM_SET_MEMORY_ATTR

struct kvm_memory_attr {
	__u64 addr;	/* page aligned */
	__u64 size;	/* page aligned */
#define KVM_MEMORY_ATTR_SHARED		(1 << 0)
#define KVM_MEMORY_ATTR_PRIVATE		(1 << 1)
	__u64 flags;
}

Second, check the KVM maintained guest memory attributes in page fault
handler (instead of checking memory existence in private fd)

Third, the memfile_notifier_ops (populate/invalidate) will be removed
from current code, the old mapping zapping can be directly handled in
this new KVM ioctl().
 
Thought?

Since this info is stored in KVM, which I think is reasonable. But for
other potential memfile_notifier users like VFIO, some KVM-to-VFIO APIs
might be needed depends on the implementaion.

It is also possible to maintain this info purely in userspace. The only
trick bit is implicit conversion support that has to be checked in KVM
page fault handler and is in the fast path.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-23 13:21       ` Chao Peng
@ 2022-05-23 15:22         ` Sean Christopherson
  2022-05-30 13:26           ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-05-23 15:22 UTC (permalink / raw)
  To: Chao Peng
  Cc: Andy Lutomirski, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Mon, May 23, 2022, Chao Peng wrote:
> On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > The alternative would be to have some kind of separate table or bitmap (part
> > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > 
> > > What do you all think?
> > 
> > My original proposal was to have expolicit shared vs. private memslots, and punch
> > holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> > handle memslot updates, conversions would be painfully slow.  That's how we ended
> > up with the current propsoal.
> > 
> > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> > and wouldn't necessarily even need to interact with the memslots.  It could be a
> > consumer of memslots, e.g. if we wanted to disallow registering regions without an
> > associated memslot, but I think we'd want to avoid even that because things will
> > get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> > memory region is temporarily removed then we wouldn't want to destroy the tracking.
> 
> Even we don't tight that to memslots, that info can only be effective
> for private memslot, right? Setting this ioctl to memory ranges defined
> in a traditional non-private memslots just makes no sense, I guess we can
> comment that in the API document.

Hrm, applying it universally would be funky, e.g. emulated MMIO would need to be
declared "shared".  But, applying it selectively would arguably be worse, e.g.
letting userspace map memory into the guest as shared for a region that's registered
as private...

On option to that mess would be to make memory shared by default, and so userspace
must declare regions that are private.  Then there's no weirdness with emulated MMIO
or "legacy" memslots.

On page fault, KVM does a lookup to see if the GPA is shared or private.  If the
GPA is private, but there is no memslot or the memslot doesn't have a private fd,
KVM exits to userspace.  If there's a memslot with a private fd, the shared/private
flag is used to resolve the 

And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump the
notifier sequence, i.e. force the page fault to retry if the GPA may have been
(un)registered between checking the type and acquiring mmu_lock.

> > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> > should be far more efficient.
> 
> What about the mis-behaved guest? I don't want to design for the worst
> case, but people may raise concern on the attack from such guest.

That's why cgroups exist.  E.g. a malicious/broken L1 can similarly abuse nested
EPT/NPT to generate a large number of shadow page tables.

> > One benefit to explicitly tracking this in KVM is that it might be useful for
> > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> > based on guest hypercalls to share/unshare memory, and then complete the transaction
> > when userspace invokes the ioctl() to complete the share/unshare.
> 
> OK, then this can be another field of states/flags/attributes. Let me
> dig up certain level of details:
> 
> First, introduce below KVM ioctl
> 
> KVM_SET_MEMORY_ATTR

Actually, if the semantics are that userspace declares memory as private, then we
can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
be a little gross because we'd need to slightly redefine the semantics for TDX, SNP,
and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng
memslot.  But I think it'd work...

I'll think more on this...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-23 15:22         ` Sean Christopherson
@ 2022-05-30 13:26           ` Chao Peng
  2022-06-10 16:14             ` Sean Christopherson
  0 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-05-30 13:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote:
> On Mon, May 23, 2022, Chao Peng wrote:
> > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> > > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > > The alternative would be to have some kind of separate table or bitmap (part
> > > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > > 
> > > > What do you all think?
> > > 
> > > My original proposal was to have expolicit shared vs. private memslots, and punch
> > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> > > handle memslot updates, conversions would be painfully slow.  That's how we ended
> > > up with the current propsoal.
> > > 
> > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> > > and wouldn't necessarily even need to interact with the memslots.  It could be a
> > > consumer of memslots, e.g. if we wanted to disallow registering regions without an
> > > associated memslot, but I think we'd want to avoid even that because things will
> > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> > > memory region is temporarily removed then we wouldn't want to destroy the tracking.
> > 
> > Even we don't tight that to memslots, that info can only be effective
> > for private memslot, right? Setting this ioctl to memory ranges defined
> > in a traditional non-private memslots just makes no sense, I guess we can
> > comment that in the API document.
> 
> Hrm, applying it universally would be funky, e.g. emulated MMIO would need to be
> declared "shared".  But, applying it selectively would arguably be worse, e.g.
> letting userspace map memory into the guest as shared for a region that's registered
> as private...
> 
> On option to that mess would be to make memory shared by default, and so userspace
> must declare regions that are private.  Then there's no weirdness with emulated MMIO
> or "legacy" memslots.
> 
> On page fault, KVM does a lookup to see if the GPA is shared or private.  If the
> GPA is private, but there is no memslot or the memslot doesn't have a private fd,
> KVM exits to userspace.  If there's a memslot with a private fd, the shared/private
> flag is used to resolve the 
> 
> And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump the
> notifier sequence, i.e. force the page fault to retry if the GPA may have been
> (un)registered between checking the type and acquiring mmu_lock.

Yeah, that makes sense.

> 
> > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> > > should be far more efficient.
> > 
> > What about the mis-behaved guest? I don't want to design for the worst
> > case, but people may raise concern on the attack from such guest.
> 
> That's why cgroups exist.  E.g. a malicious/broken L1 can similarly abuse nested
> EPT/NPT to generate a large number of shadow page tables.

I havn't seen we had that in KVM. Is there any plan/discussion to add that?

> 
> > > One benefit to explicitly tracking this in KVM is that it might be useful for
> > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> > > based on guest hypercalls to share/unshare memory, and then complete the transaction
> > > when userspace invokes the ioctl() to complete the share/unshare.
> > 
> > OK, then this can be another field of states/flags/attributes. Let me
> > dig up certain level of details:
> > 
> > First, introduce below KVM ioctl
> > 
> > KVM_SET_MEMORY_ATTR
> 
> Actually, if the semantics are that userspace declares memory as private, then we
> can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
> be a little gross because we'd need to slightly redefine the semantics for TDX, SNP,
> and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng
> memslot.  But I think it'd work...

These existing ioctls looks good for TDX and probably SNP as well. For
softrware-protected VM types, it may not be enough. Maybe for the first
step we can reuse this for all hardware based solutions and invent new
interface when software-protected solution gets really supported.

There is semantics difference for fd-based private memory. Current above
two ioctls() use userspace addreess(hva) while for fd-based it should be
fd+offset, and probably it's better to use gpa in this case. Then we
will need change existing semantics and break backward-compatibility.

Chao

> 
> I'll think more on this...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-05-19 15:37 ` [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
@ 2022-05-31 19:15   ` Vishal Annapurve
  2022-06-01 10:17     ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Vishal Annapurve @ 2022-05-31 19:15 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Thu, May 19, 2022 at 8:41 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace through ordinary MMU
> access (e.g., read/write/mmap). However, the file content can be
> accessed via a different mechanism (e.g. KVM MMU) indirectly.
>

SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
initial guest boot memory with the needed blobs.
TDX already supports a KVM IOCTL to transfer contents to private
memory using the TDX module but rest of the implementations will need
to invent
a way to do this.

Is there a plan to support a common implementation for either allowing
initial write access from userspace to private fd or adding a KVM
IOCTL to transfer contents to such a file,
as part of this series through future revisions?

Regards,
Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-05-31 19:15   ` Vishal Annapurve
@ 2022-06-01 10:17     ` Chao Peng
  2022-06-01 12:11       ` Gupta, Pankaj
  0 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-06-01 10:17 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Tue, May 31, 2022 at 12:15:00PM -0700, Vishal Annapurve wrote:
> On Thu, May 19, 2022 at 8:41 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace through ordinary MMU
> > access (e.g., read/write/mmap). However, the file content can be
> > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> >
> 
> SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
> initial guest boot memory with the needed blobs.
> TDX already supports a KVM IOCTL to transfer contents to private
> memory using the TDX module but rest of the implementations will need
> to invent
> a way to do this.

There are some discussions in https://lkml.org/lkml/2022/5/9/1292
already. I somehow agree with Sean. TDX is using an dedicated ioctl to
copy guest boot memory to private fd so the rest can do that similarly.
The concern is the performance (extra memcpy) but it's trivial since the
initial guest payload is usually optimized in size.

> 
> Is there a plan to support a common implementation for either allowing
> initial write access from userspace to private fd or adding a KVM
> IOCTL to transfer contents to such a file,
> as part of this series through future revisions?

Indeed, adding pre-boot private memory populating on current design
isn't impossible, but there are still some opens, e.g. how to expose
private fd to userspace for access, pKVM and CC usages may have
different requirements. Before that's well-studied I would tend to not
add that and instead use an ioctl to copy. Whether we need a generic
ioctl or feature-specific ioctl, I don't have strong opinion here.
Current TDX uses a feature-specific ioctl so it's not covered in this
series.

Chao
> 
> Regards,
> Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-06-01 10:17     ` Chao Peng
@ 2022-06-01 12:11       ` Gupta, Pankaj
  2022-06-02 10:07         ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Gupta, Pankaj @ 2022-06-01 12:11 UTC (permalink / raw)
  To: Chao Peng, Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko


>>> Introduce a new memfd_create() flag indicating the content of the
>>> created memfd is inaccessible from userspace through ordinary MMU
>>> access (e.g., read/write/mmap). However, the file content can be
>>> accessed via a different mechanism (e.g. KVM MMU) indirectly.
>>>
>>
>> SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
>> initial guest boot memory with the needed blobs.
>> TDX already supports a KVM IOCTL to transfer contents to private
>> memory using the TDX module but rest of the implementations will need
>> to invent
>> a way to do this.
> 
> There are some discussions in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F5%2F9%2F1292&amp;data=05%7C01%7Cpankaj.gupta%40amd.com%7Cb81ef334e2dd44c6143308da43b87d17%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637896756895977587%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=oQbM2Hj7GlhJTwnTM%2FPnwsfJlmTL7JR9ULBysAqm6V8%3D&amp;reserved=0
> already. I somehow agree with Sean. TDX is using an dedicated ioctl to
> copy guest boot memory to private fd so the rest can do that similarly.
> The concern is the performance (extra memcpy) but it's trivial since the
> initial guest payload is usually optimized in size.
> 
>>
>> Is there a plan to support a common implementation for either allowing
>> initial write access from userspace to private fd or adding a KVM
>> IOCTL to transfer contents to such a file,
>> as part of this series through future revisions?
> 
> Indeed, adding pre-boot private memory populating on current design
> isn't impossible, but there are still some opens, e.g. how to expose
> private fd to userspace for access, pKVM and CC usages may have
> different requirements. Before that's well-studied I would tend to not
> add that and instead use an ioctl to copy. Whether we need a generic
> ioctl or feature-specific ioctl, I don't have strong opinion here.
> Current TDX uses a feature-specific ioctl so it's not covered in this
> series.

Common function or ioctl to populate preboot private memory actually 
makes sense.

Sorry, did not follow much of TDX code yet, Is it possible to filter out
the current TDX specific ioctl to common function so that it can be used 
by other technologies?

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-06-01 12:11       ` Gupta, Pankaj
@ 2022-06-02 10:07         ` Chao Peng
  2022-06-14 20:23           ` Sean Christopherson
  0 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-06-02 10:07 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: Vishal Annapurve, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Wed, Jun 01, 2022 at 02:11:42PM +0200, Gupta, Pankaj wrote:
> 
> > > > Introduce a new memfd_create() flag indicating the content of the
> > > > created memfd is inaccessible from userspace through ordinary MMU
> > > > access (e.g., read/write/mmap). However, the file content can be
> > > > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > > > 
> > > 
> > > SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
> > > initial guest boot memory with the needed blobs.
> > > TDX already supports a KVM IOCTL to transfer contents to private
> > > memory using the TDX module but rest of the implementations will need
> > > to invent
> > > a way to do this.
> > 
> > There are some discussions in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F5%2F9%2F1292&amp;data=05%7C01%7Cpankaj.gupta%40amd.com%7Cb81ef334e2dd44c6143308da43b87d17%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637896756895977587%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=oQbM2Hj7GlhJTwnTM%2FPnwsfJlmTL7JR9ULBysAqm6V8%3D&amp;reserved=0
> > already. I somehow agree with Sean. TDX is using an dedicated ioctl to
> > copy guest boot memory to private fd so the rest can do that similarly.
> > The concern is the performance (extra memcpy) but it's trivial since the
> > initial guest payload is usually optimized in size.
> > 
> > > 
> > > Is there a plan to support a common implementation for either allowing
> > > initial write access from userspace to private fd or adding a KVM
> > > IOCTL to transfer contents to such a file,
> > > as part of this series through future revisions?
> > 
> > Indeed, adding pre-boot private memory populating on current design
> > isn't impossible, but there are still some opens, e.g. how to expose
> > private fd to userspace for access, pKVM and CC usages may have
> > different requirements. Before that's well-studied I would tend to not
> > add that and instead use an ioctl to copy. Whether we need a generic
> > ioctl or feature-specific ioctl, I don't have strong opinion here.
> > Current TDX uses a feature-specific ioctl so it's not covered in this
> > series.
> 
> Common function or ioctl to populate preboot private memory actually makes
> sense.
> 
> Sorry, did not follow much of TDX code yet, Is it possible to filter out
> the current TDX specific ioctl to common function so that it can be used by
> other technologies?

TDX code is here:
https://patchwork.kernel.org/project/kvm/patch/70ed041fd47c1f7571aa259450b3f9244edda48d.1651774250.git.isaku.yamahata@intel.com/

AFAICS It might be possible to filter that out to a common function. But
would like to hear from Paolo/Sean for their opinion.

Chao
> 
> Thanks,
> Pankaj

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (7 preceding siblings ...)
  2022-05-19 15:37 ` [PATCH v6 8/8] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
@ 2022-06-06 20:09 ` Vishal Annapurve
  2022-06-07  6:57   ` Chao Peng
  8 siblings, 1 reply; 58+ messages in thread
From: Vishal Annapurve @ 2022-06-06 20:09 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

>
> Private memory map/unmap and conversion
> ---------------------------------------
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
>   - map: default fallocate() with mode=0.
>   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.
>
....
>    QEMU: https://github.com/chao-p/qemu/tree/privmem-v6
>
> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx \
> -object memory-backend-memfd-private,id=ram1,size=2G \
> -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
>

There should be more discussion around double allocation scenarios
when using the private fd approach. A malicious guest or buggy
userspace VMM can cause physical memory getting allocated for both
shared (memory accessible from host) and private fds backing the guest
memory.
Userspace VMM will need to unback the shared guest memory while
handling the conversion from shared to private in order to prevent
double allocation even with malicious guests or bugs in userspace VMM.

Options to unback shared guest memory seem to be:
1) madvise(.., MADV_DONTNEED/MADV_REMOVE) - This option won't stop
kernel from backing the shared memory on subsequent write accesses
2) fallocate(..., FALLOC_FL_PUNCH_HOLE...) - For file backed shared
guest memory, this option still is similar to madvice since this would
still allow shared memory to get backed on write accesses
3) munmap - This would give away the contiguous virtual memory region
reservation with holes in the guest backing memory, which might make
guest memory management difficult.
4) mprotect(... PROT_NONE) - This would keep the virtual memory
address range backing the guest memory preserved

ram_block_discard_range_fd from reference implementation:
https://github.com/chao-p/qemu/tree/privmem-v6 seems to be relying on
fallocate/madvise.

Any thoughts/suggestions around better ways to unback the shared
memory in order to avoid double allocation scenarios?

Regards,
Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-06 20:09 ` [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Vishal Annapurve
@ 2022-06-07  6:57   ` Chao Peng
  2022-06-08  0:55     ` Marc Orr
  0 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-06-07  6:57 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Mon, Jun 06, 2022 at 01:09:50PM -0700, Vishal Annapurve wrote:
> >
> > Private memory map/unmap and conversion
> > ---------------------------------------
> > Userspace's map/unmap operations are done by fallocate() ioctl on the
> > backing store fd.
> >   - map: default fallocate() with mode=0.
> >   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> > The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> > secondary MMU page tables.
> >
> ....
> >    QEMU: https://github.com/chao-p/qemu/tree/privmem-v6
> >
> > An example QEMU command line for TDX test:
> > -object tdx-guest,id=tdx \
> > -object memory-backend-memfd-private,id=ram1,size=2G \
> > -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
> >
> 
> There should be more discussion around double allocation scenarios
> when using the private fd approach. A malicious guest or buggy
> userspace VMM can cause physical memory getting allocated for both
> shared (memory accessible from host) and private fds backing the guest
> memory.
> Userspace VMM will need to unback the shared guest memory while
> handling the conversion from shared to private in order to prevent
> double allocation even with malicious guests or bugs in userspace VMM.

I don't know how malicious guest can cause that. The initial design of
this serie is to put the private/shared memory into two different
address spaces and gives usersapce VMM the flexibility to convert
between the two. It can choose respect the guest conversion request or
not.

It's possible for a usrspace VMM to cause double allocation if it fails
to call the unback operation during the conversion, this may be a bug
or not. Double allocation may not be a wrong thing, even in conception.
At least TDX allows you to use half shared half private in guest, means
both shared/private can be effective. Unbacking the memory is just the
current QEMU implementation choice.

Chao
> 
> Options to unback shared guest memory seem to be:
> 1) madvise(.., MADV_DONTNEED/MADV_REMOVE) - This option won't stop
> kernel from backing the shared memory on subsequent write accesses
> 2) fallocate(..., FALLOC_FL_PUNCH_HOLE...) - For file backed shared
> guest memory, this option still is similar to madvice since this would
> still allow shared memory to get backed on write accesses
> 3) munmap - This would give away the contiguous virtual memory region
> reservation with holes in the guest backing memory, which might make
> guest memory management difficult.
> 4) mprotect(... PROT_NONE) - This would keep the virtual memory
> address range backing the guest memory preserved
> 
> ram_block_discard_range_fd from reference implementation:
> https://github.com/chao-p/qemu/tree/privmem-v6 seems to be relying on
> fallocate/madvise.
> 
> Any thoughts/suggestions around better ways to unback the shared
> memory in order to avoid double allocation scenarios?
> 
> Regards,
> Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-07  6:57   ` Chao Peng
@ 2022-06-08  0:55     ` Marc Orr
  2022-06-08  2:18       ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Marc Orr @ 2022-06-08  0:55 UTC (permalink / raw)
  To: Chao Peng
  Cc: Vishal Annapurve, kvm list, LKML, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

On Tue, Jun 7, 2022 at 12:01 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Mon, Jun 06, 2022 at 01:09:50PM -0700, Vishal Annapurve wrote:
> > >
> > > Private memory map/unmap and conversion
> > > ---------------------------------------
> > > Userspace's map/unmap operations are done by fallocate() ioctl on the
> > > backing store fd.
> > >   - map: default fallocate() with mode=0.
> > >   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> > > The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> > > secondary MMU page tables.
> > >
> > ....
> > >    QEMU: https://github.com/chao-p/qemu/tree/privmem-v6
> > >
> > > An example QEMU command line for TDX test:
> > > -object tdx-guest,id=tdx \
> > > -object memory-backend-memfd-private,id=ram1,size=2G \
> > > -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
> > >
> >
> > There should be more discussion around double allocation scenarios
> > when using the private fd approach. A malicious guest or buggy
> > userspace VMM can cause physical memory getting allocated for both
> > shared (memory accessible from host) and private fds backing the guest
> > memory.
> > Userspace VMM will need to unback the shared guest memory while
> > handling the conversion from shared to private in order to prevent
> > double allocation even with malicious guests or bugs in userspace VMM.
>
> I don't know how malicious guest can cause that. The initial design of
> this serie is to put the private/shared memory into two different
> address spaces and gives usersapce VMM the flexibility to convert
> between the two. It can choose respect the guest conversion request or
> not.

For example, the guest could maliciously give a device driver a
private page so that a host-side virtual device will blindly write the
private page.

> It's possible for a usrspace VMM to cause double allocation if it fails
> to call the unback operation during the conversion, this may be a bug
> or not. Double allocation may not be a wrong thing, even in conception.
> At least TDX allows you to use half shared half private in guest, means
> both shared/private can be effective. Unbacking the memory is just the
> current QEMU implementation choice.

Right. But the idea is that this patch series should accommodate all
of the CVM architectures. Or at least that's what I know was
envisioned last time we discussed this topic for SNP [*].

Regardless, it's important to ensure that the VM respects its memory
budget. For example, within Google, we run VMs inside of containers.
So if we double allocate we're going to OOM. This seems acceptable for
an early version of CVMs. But ultimately, I think we need a more
robust way to ensure that the VM operates within its memory container.
Otherwise, the OOM is going to be hard to diagnose and distinguish
from a real OOM.

[*] https://lore.kernel.org/all/20210820155918.7518-1-brijesh.singh@amd.com/

>
> Chao
> >
> > Options to unback shared guest memory seem to be:
> > 1) madvise(.., MADV_DONTNEED/MADV_REMOVE) - This option won't stop
> > kernel from backing the shared memory on subsequent write accesses
> > 2) fallocate(..., FALLOC_FL_PUNCH_HOLE...) - For file backed shared
> > guest memory, this option still is similar to madvice since this would
> > still allow shared memory to get backed on write accesses
> > 3) munmap - This would give away the contiguous virtual memory region
> > reservation with holes in the guest backing memory, which might make
> > guest memory management difficult.
> > 4) mprotect(... PROT_NONE) - This would keep the virtual memory
> > address range backing the guest memory preserved
> >
> > ram_block_discard_range_fd from reference implementation:
> > https://github.com/chao-p/qemu/tree/privmem-v6 seems to be relying on
> > fallocate/madvise.
> >
> > Any thoughts/suggestions around better ways to unback the shared
> > memory in order to avoid double allocation scenarios?

I agree with Vishal. I think this patch set is making great progress.
But the double allocation scenario seems like a high-level design
issue that warrants more discussion.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-08  0:55     ` Marc Orr
@ 2022-06-08  2:18       ` Chao Peng
  2022-06-08 19:37         ` Vishal Annapurve
  2022-06-10  0:11         ` Marc Orr
  0 siblings, 2 replies; 58+ messages in thread
From: Chao Peng @ 2022-06-08  2:18 UTC (permalink / raw)
  To: Marc Orr
  Cc: Vishal Annapurve, kvm list, LKML, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

On Tue, Jun 07, 2022 at 05:55:46PM -0700, Marc Orr wrote:
> On Tue, Jun 7, 2022 at 12:01 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Mon, Jun 06, 2022 at 01:09:50PM -0700, Vishal Annapurve wrote:
> > > >
> > > > Private memory map/unmap and conversion
> > > > ---------------------------------------
> > > > Userspace's map/unmap operations are done by fallocate() ioctl on the
> > > > backing store fd.
> > > >   - map: default fallocate() with mode=0.
> > > >   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> > > > The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> > > > secondary MMU page tables.
> > > >
> > > ....
> > > >    QEMU: https://github.com/chao-p/qemu/tree/privmem-v6
> > > >
> > > > An example QEMU command line for TDX test:
> > > > -object tdx-guest,id=tdx \
> > > > -object memory-backend-memfd-private,id=ram1,size=2G \
> > > > -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
> > > >
> > >
> > > There should be more discussion around double allocation scenarios
> > > when using the private fd approach. A malicious guest or buggy
> > > userspace VMM can cause physical memory getting allocated for both
> > > shared (memory accessible from host) and private fds backing the guest
> > > memory.
> > > Userspace VMM will need to unback the shared guest memory while
> > > handling the conversion from shared to private in order to prevent
> > > double allocation even with malicious guests or bugs in userspace VMM.
> >
> > I don't know how malicious guest can cause that. The initial design of
> > this serie is to put the private/shared memory into two different
> > address spaces and gives usersapce VMM the flexibility to convert
> > between the two. It can choose respect the guest conversion request or
> > not.
> 
> For example, the guest could maliciously give a device driver a
> private page so that a host-side virtual device will blindly write the
> private page.

With this patch series, it's actually even not possible for userspace VMM
to allocate private page by a direct write, it's basically unmapped from
there. If it really wants to, it should so something special, by intention,
that's basically the conversion, which we should allow.

> 
> > It's possible for a usrspace VMM to cause double allocation if it fails
> > to call the unback operation during the conversion, this may be a bug
> > or not. Double allocation may not be a wrong thing, even in conception.
> > At least TDX allows you to use half shared half private in guest, means
> > both shared/private can be effective. Unbacking the memory is just the
> > current QEMU implementation choice.
> 
> Right. But the idea is that this patch series should accommodate all
> of the CVM architectures. Or at least that's what I know was
> envisioned last time we discussed this topic for SNP [*].

AFAICS, this series should work for both TDX and SNP, and other CVM
architectures. I don't see where TDX can work but SNP cannot, or I
missed something here?

> 
> Regardless, it's important to ensure that the VM respects its memory
> budget. For example, within Google, we run VMs inside of containers.
> So if we double allocate we're going to OOM. This seems acceptable for
> an early version of CVMs. But ultimately, I think we need a more
> robust way to ensure that the VM operates within its memory container.
> Otherwise, the OOM is going to be hard to diagnose and distinguish
> from a real OOM.

Thanks for bringing this up. But in my mind I still think userspace VMM
can do and it's its responsibility to guarantee that, if that is hard
required. By design, userspace VMM is the decision-maker for page
conversion and has all the necessary information to know which page is
shared/private. It also has the necessary knobs to allocate/free the
physical pages for guest memory. Definitely, we should make userspace
VMM more robust.

Chao
> 
> [*] https://lore.kernel.org/all/20210820155918.7518-1-brijesh.singh@amd.com/
> 
> >
> > Chao
> > >
> > > Options to unback shared guest memory seem to be:
> > > 1) madvise(.., MADV_DONTNEED/MADV_REMOVE) - This option won't stop
> > > kernel from backing the shared memory on subsequent write accesses
> > > 2) fallocate(..., FALLOC_FL_PUNCH_HOLE...) - For file backed shared
> > > guest memory, this option still is similar to madvice since this would
> > > still allow shared memory to get backed on write accesses
> > > 3) munmap - This would give away the contiguous virtual memory region
> > > reservation with holes in the guest backing memory, which might make
> > > guest memory management difficult.
> > > 4) mprotect(... PROT_NONE) - This would keep the virtual memory
> > > address range backing the guest memory preserved
> > >
> > > ram_block_discard_range_fd from reference implementation:
> > > https://github.com/chao-p/qemu/tree/privmem-v6 seems to be relying on
> > > fallocate/madvise.
> > >
> > > Any thoughts/suggestions around better ways to unback the shared
> > > memory in order to avoid double allocation scenarios?
> 
> I agree with Vishal. I think this patch set is making great progress.
> But the double allocation scenario seems like a high-level design
> issue that warrants more discussion.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-08  2:18       ` Chao Peng
@ 2022-06-08 19:37         ` Vishal Annapurve
  2022-06-09 20:29           ` Sean Christopherson
  2022-06-10  0:11         ` Marc Orr
  1 sibling, 1 reply; 58+ messages in thread
From: Vishal Annapurve @ 2022-06-08 19:37 UTC (permalink / raw)
  To: Chao Peng
  Cc: Marc Orr, kvm list, LKML, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

...
> With this patch series, it's actually even not possible for userspace VMM
> to allocate private page by a direct write, it's basically unmapped from
> there. If it really wants to, it should so something special, by intention,
> that's basically the conversion, which we should allow.
>

A VM can pass GPA backed by private pages to userspace VMM and when
Userspace VMM accesses the backing hva there will be pages allocated
to back the shared fd causing 2 sets of pages backing the same guest
memory range.

> Thanks for bringing this up. But in my mind I still think userspace VMM
> can do and it's its responsibility to guarantee that, if that is hard
> required. By design, userspace VMM is the decision-maker for page
> conversion and has all the necessary information to know which page is
> shared/private. It also has the necessary knobs to allocate/free the
> physical pages for guest memory. Definitely, we should make userspace
> VMM more robust.

Making Userspace VMM more robust to avoid double allocation can get
complex, it will have to keep track of all in-use (by Userspace VMM)
shared fd memory to disallow conversion from shared to private and
will have to ensure that all guest supplied addresses belong to shared
GPA ranges.
A coarser but simpler alternative could be to always allow shared to
private conversion with unbacking the memory from shared fd and exit
if the VMM runs in double allocation scenarios. In either cases,
unbacking shared fd memory ideally should prevent memory allocation on
subsequent write accesses to ensure double allocation scenarios are
caught early.

Regards,
Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-08 19:37         ` Vishal Annapurve
@ 2022-06-09 20:29           ` Sean Christopherson
  2022-06-14  7:28             ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-06-09 20:29 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Chao Peng, Marc Orr, kvm list, LKML, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> ...
> > With this patch series, it's actually even not possible for userspace VMM
> > to allocate private page by a direct write, it's basically unmapped from
> > there. If it really wants to, it should so something special, by intention,
> > that's basically the conversion, which we should allow.
> >
> 
> A VM can pass GPA backed by private pages to userspace VMM and when
> Userspace VMM accesses the backing hva there will be pages allocated
> to back the shared fd causing 2 sets of pages backing the same guest
> memory range.
> 
> > Thanks for bringing this up. But in my mind I still think userspace VMM
> > can do and it's its responsibility to guarantee that, if that is hard
> > required.

That was my initial reaction too, but there are unfortunate side effects to punting
this to userspace. 

> By design, userspace VMM is the decision-maker for page
> > conversion and has all the necessary information to know which page is
> > shared/private. It also has the necessary knobs to allocate/free the
> > physical pages for guest memory. Definitely, we should make userspace
> > VMM more robust.
> 
> Making Userspace VMM more robust to avoid double allocation can get
> complex, it will have to keep track of all in-use (by Userspace VMM)
> shared fd memory to disallow conversion from shared to private and
> will have to ensure that all guest supplied addresses belong to shared
> GPA ranges.

IMO, the complexity argument isn't sufficient justfication for introducing new
kernel functionality.  If multiple processes are accessing guest memory then there
already needs to be some amount of coordination, i.e. it can't be _that_ complex.

My concern with forcing userspace to fully handle unmapping shared memory is that
it may lead to additional performance overhead and/or noisy neighbor issues, even
if all guests are well-behaved.

Unnmapping arbitrary ranges will fragment the virtual address space and consume
more memory for all the result VMAs.  The extra memory consumption isn't that big
of a deal, and it will be self-healing to some extent as VMAs will get merged when
the holes are filled back in (if the guest converts back to shared), but it's still
less than desirable.

More concerning is having to take mmap_lock for write for every conversion, which
is very problematic for configurations where a single userspace process maps memory
belong to multiple VMs.  Unmapping and remapping on every conversion will create a
bottleneck, especially if a VM has sub-optimal behavior and is converting pages at
a high rate.

One argument is that userspace can simply rely on cgroups to detect misbehaving
guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM
kill from the host is typically considered a _host_ issue and will be treated as
a missed SLO.

An idea for handling this in the kernel without too much complexity would be to
add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from
allocating pages, i.e. holes can only be filled by an explicit fallocate().  Minor
faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would
still work, but writes to previously unreserved/unallocated memory would get a
SIGSEGV on something it has mapped.  That would allow the userspace VMM to prevent
unintentional allocations without having to coordinate unmapping/remapping across
multiple processes.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-08  2:18       ` Chao Peng
  2022-06-08 19:37         ` Vishal Annapurve
@ 2022-06-10  0:11         ` Marc Orr
  1 sibling, 0 replies; 58+ messages in thread
From: Marc Orr @ 2022-06-10  0:11 UTC (permalink / raw)
  To: Chao Peng
  Cc: Vishal Annapurve, kvm list, LKML, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

On Tue, Jun 7, 2022 at 7:22 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Tue, Jun 07, 2022 at 05:55:46PM -0700, Marc Orr wrote:
> > On Tue, Jun 7, 2022 at 12:01 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > On Mon, Jun 06, 2022 at 01:09:50PM -0700, Vishal Annapurve wrote:
> > > > >
> > > > > Private memory map/unmap and conversion
> > > > > ---------------------------------------
> > > > > Userspace's map/unmap operations are done by fallocate() ioctl on the
> > > > > backing store fd.
> > > > >   - map: default fallocate() with mode=0.
> > > > >   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> > > > > The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> > > > > secondary MMU page tables.
> > > > >
> > > > ....
> > > > >    QEMU: https://github.com/chao-p/qemu/tree/privmem-v6
> > > > >
> > > > > An example QEMU command line for TDX test:
> > > > > -object tdx-guest,id=tdx \
> > > > > -object memory-backend-memfd-private,id=ram1,size=2G \
> > > > > -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
> > > > >
> > > >
> > > > There should be more discussion around double allocation scenarios
> > > > when using the private fd approach. A malicious guest or buggy
> > > > userspace VMM can cause physical memory getting allocated for both
> > > > shared (memory accessible from host) and private fds backing the guest
> > > > memory.
> > > > Userspace VMM will need to unback the shared guest memory while
> > > > handling the conversion from shared to private in order to prevent
> > > > double allocation even with malicious guests or bugs in userspace VMM.
> > >
> > > I don't know how malicious guest can cause that. The initial design of
> > > this serie is to put the private/shared memory into two different
> > > address spaces and gives usersapce VMM the flexibility to convert
> > > between the two. It can choose respect the guest conversion request or
> > > not.
> >
> > For example, the guest could maliciously give a device driver a
> > private page so that a host-side virtual device will blindly write the
> > private page.
>
> With this patch series, it's actually even not possible for userspace VMM
> to allocate private page by a direct write, it's basically unmapped from
> there. If it really wants to, it should so something special, by intention,
> that's basically the conversion, which we should allow.

I think Vishal did a better job to explain this scenario in his last
reply than I did.

> > > It's possible for a usrspace VMM to cause double allocation if it fails
> > > to call the unback operation during the conversion, this may be a bug
> > > or not. Double allocation may not be a wrong thing, even in conception.
> > > At least TDX allows you to use half shared half private in guest, means
> > > both shared/private can be effective. Unbacking the memory is just the
> > > current QEMU implementation choice.
> >
> > Right. But the idea is that this patch series should accommodate all
> > of the CVM architectures. Or at least that's what I know was
> > envisioned last time we discussed this topic for SNP [*].
>
> AFAICS, this series should work for both TDX and SNP, and other CVM
> architectures. I don't see where TDX can work but SNP cannot, or I
> missed something here?

Agreed. I was just responding to the "At least TDX..." bit. Sorry for
any confusion.

> >
> > Regardless, it's important to ensure that the VM respects its memory
> > budget. For example, within Google, we run VMs inside of containers.
> > So if we double allocate we're going to OOM. This seems acceptable for
> > an early version of CVMs. But ultimately, I think we need a more
> > robust way to ensure that the VM operates within its memory container.
> > Otherwise, the OOM is going to be hard to diagnose and distinguish
> > from a real OOM.
>
> Thanks for bringing this up. But in my mind I still think userspace VMM
> can do and it's its responsibility to guarantee that, if that is hard
> required. By design, userspace VMM is the decision-maker for page
> conversion and has all the necessary information to know which page is
> shared/private. It also has the necessary knobs to allocate/free the
> physical pages for guest memory. Definitely, we should make userspace
> VMM more robust.

Vishal and Sean did a better job to articulate the concern in their
most recent replies.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-30 13:26           ` Chao Peng
@ 2022-06-10 16:14             ` Sean Christopherson
  2022-06-14  6:45               ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-06-10 16:14 UTC (permalink / raw)
  To: Chao Peng
  Cc: Andy Lutomirski, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Mon, May 30, 2022, Chao Peng wrote:
> On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote:
> > Actually, if the semantics are that userspace declares memory as private, then we
> > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
> > be a little gross because we'd need to slightly redefine the semantics for TDX, SNP,
> > and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng
> > memslot.  But I think it'd work...
> 
> These existing ioctls looks good for TDX and probably SNP as well. For
> softrware-protected VM types, it may not be enough. Maybe for the first
> step we can reuse this for all hardware based solutions and invent new
> interface when software-protected solution gets really supported.
> 
> There is semantics difference for fd-based private memory. Current above
> two ioctls() use userspace addreess(hva) while for fd-based it should be
> fd+offset, and probably it's better to use gpa in this case. Then we
> will need change existing semantics and break backward-compatibility.

My thought was to keep the existing semantics for VMs with type==0, i.e. SEV and
SEV-ES VMs.  It's a bit gross, but the pinning behavior is a dead end for SNP and
TDX, so it effectively needs to be deprecated anyways.  I'm definitely not opposed
to a new ioctl if Paolo or others think this is too awful, but burning an ioctl
for this seems wasteful.

Then generic KVM can do something like:

	case KVM_MEMORY_ENCRYPT_REG_REGION:
	case KVM_MEMORY_ENCRYPT_UNREG_REGION:
		struct kvm_enc_region region;

		if (!kvm_arch_vm_supports_private_memslots(kvm))
			goto arch_vm_ioctl;

		r = -EFAULT;
		if (copy_from_user(&region, argp, sizeof(region)))
			goto out;

		r = kvm_set_encrypted_region(ioctl, &region);
		break;
	default:
arch_vm_ioctl:
		r = kvm_arch_vm_ioctl(filp, ioctl, arg);


where common KVM provides

  __weak void kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
  {
	return false;
  }

and x86 overrides that to

  bool kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
  {
  	/* I can't remember what we decided on calling type '0' VMs. */
	return !!kvm->vm_type;
  }

and if someone ever wants to enable private memslot for SEV/SEV-ES guests we can
always add a capability or even a new VM type.

pKVM on arm can then obviously implement kvm_arch_vm_supports_private_memslots()
to grab whatever identifies a pKVM VM.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-06-10 16:14             ` Sean Christopherson
@ 2022-06-14  6:45               ` Chao Peng
  0 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-06-14  6:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, Jun 10, 2022 at 04:14:21PM +0000, Sean Christopherson wrote:
> On Mon, May 30, 2022, Chao Peng wrote:
> > On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote:
> > > Actually, if the semantics are that userspace declares memory as private, then we
> > > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
> > > be a little gross because we'd need to slightly redefine the semantics for TDX, SNP,
> > > and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng
> > > memslot.  But I think it'd work...
> > 
> > These existing ioctls looks good for TDX and probably SNP as well. For
> > softrware-protected VM types, it may not be enough. Maybe for the first
> > step we can reuse this for all hardware based solutions and invent new
> > interface when software-protected solution gets really supported.
> > 
> > There is semantics difference for fd-based private memory. Current above
> > two ioctls() use userspace addreess(hva) while for fd-based it should be
> > fd+offset, and probably it's better to use gpa in this case. Then we
> > will need change existing semantics and break backward-compatibility.
> 
> My thought was to keep the existing semantics for VMs with type==0, i.e. SEV and
> SEV-ES VMs.  It's a bit gross, but the pinning behavior is a dead end for SNP and
> TDX, so it effectively needs to be deprecated anyways. 

Yes agreed.

> I'm definitely not opposed
> to a new ioctl if Paolo or others think this is too awful, but burning an ioctl
> for this seems wasteful.

Yes, I also feel confortable if it's acceptable to reuse kvm_enc_region
to pass _gpa_ range for this new type.

> 
> Then generic KVM can do something like:
> 
> 	case KVM_MEMORY_ENCRYPT_REG_REGION:
> 	case KVM_MEMORY_ENCRYPT_UNREG_REGION:
> 		struct kvm_enc_region region;
> 
> 		if (!kvm_arch_vm_supports_private_memslots(kvm))
> 			goto arch_vm_ioctl;
> 
> 		r = -EFAULT;
> 		if (copy_from_user(&region, argp, sizeof(region)))
> 			goto out;
> 
> 		r = kvm_set_encrypted_region(ioctl, &region);
> 		break;
> 	default:
> arch_vm_ioctl:
> 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> 
> 
> where common KVM provides
> 
>   __weak void kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
>   {
> 	return false;
>   }

I already had kvm_arch_private_mem_supported() introduced in patch-07
so that can be reused.

> 
> and x86 overrides that to
> 
>   bool kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
>   {
>   	/* I can't remember what we decided on calling type '0' VMs. */
> 	return !!kvm->vm_type;
>   }
> 
> and if someone ever wants to enable private memslot for SEV/SEV-ES guests we can
> always add a capability or even a new VM type.
> 
> pKVM on arm can then obviously implement kvm_arch_vm_supports_private_memslots()
> to grab whatever identifies a pKVM VM.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-09 20:29           ` Sean Christopherson
@ 2022-06-14  7:28             ` Chao Peng
  2022-06-14 17:37               ` Andy Lutomirski
  0 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-06-14  7:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Marc Orr, kvm list, LKML, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

On Thu, Jun 09, 2022 at 08:29:06PM +0000, Sean Christopherson wrote:
> On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> > ...
> > > With this patch series, it's actually even not possible for userspace VMM
> > > to allocate private page by a direct write, it's basically unmapped from
> > > there. If it really wants to, it should so something special, by intention,
> > > that's basically the conversion, which we should allow.
> > >
> > 
> > A VM can pass GPA backed by private pages to userspace VMM and when
> > Userspace VMM accesses the backing hva there will be pages allocated
> > to back the shared fd causing 2 sets of pages backing the same guest
> > memory range.
> > 
> > > Thanks for bringing this up. But in my mind I still think userspace VMM
> > > can do and it's its responsibility to guarantee that, if that is hard
> > > required.
> 
> That was my initial reaction too, but there are unfortunate side effects to punting
> this to userspace. 
> 
> > By design, userspace VMM is the decision-maker for page
> > > conversion and has all the necessary information to know which page is
> > > shared/private. It also has the necessary knobs to allocate/free the
> > > physical pages for guest memory. Definitely, we should make userspace
> > > VMM more robust.
> > 
> > Making Userspace VMM more robust to avoid double allocation can get
> > complex, it will have to keep track of all in-use (by Userspace VMM)
> > shared fd memory to disallow conversion from shared to private and
> > will have to ensure that all guest supplied addresses belong to shared
> > GPA ranges.
> 
> IMO, the complexity argument isn't sufficient justfication for introducing new
> kernel functionality.  If multiple processes are accessing guest memory then there
> already needs to be some amount of coordination, i.e. it can't be _that_ complex.
> 
> My concern with forcing userspace to fully handle unmapping shared memory is that
> it may lead to additional performance overhead and/or noisy neighbor issues, even
> if all guests are well-behaved.
> 
> Unnmapping arbitrary ranges will fragment the virtual address space and consume
> more memory for all the result VMAs.  The extra memory consumption isn't that big
> of a deal, and it will be self-healing to some extent as VMAs will get merged when
> the holes are filled back in (if the guest converts back to shared), but it's still
> less than desirable.
> 
> More concerning is having to take mmap_lock for write for every conversion, which
> is very problematic for configurations where a single userspace process maps memory
> belong to multiple VMs.  Unmapping and remapping on every conversion will create a
> bottleneck, especially if a VM has sub-optimal behavior and is converting pages at
> a high rate.
> 
> One argument is that userspace can simply rely on cgroups to detect misbehaving
> guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM
> kill from the host is typically considered a _host_ issue and will be treated as
> a missed SLO.
> 
> An idea for handling this in the kernel without too much complexity would be to
> add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from
> allocating pages, i.e. holes can only be filled by an explicit fallocate().  Minor
> faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would
> still work, but writes to previously unreserved/unallocated memory would get a
> SIGSEGV on something it has mapped.  That would allow the userspace VMM to prevent
> unintentional allocations without having to coordinate unmapping/remapping across
> multiple processes.

Since this is mainly for shared memory and the motivation is catching
misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
those range backed by private fd as PROT_NONE during the conversion so
subsequence misbehaved accesses will be blocked instead of causing double
allocation silently.

Chao

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-14  7:28             ` Chao Peng
@ 2022-06-14 17:37               ` Andy Lutomirski
  2022-06-14 19:08                 ` Sean Christopherson
  0 siblings, 1 reply; 58+ messages in thread
From: Andy Lutomirski @ 2022-06-14 17:37 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, Vishal Annapurve, Marc Orr, kvm list, LKML,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Andy Lutomirski, Jun Nakajima, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko

On Tue, Jun 14, 2022 at 12:32 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Thu, Jun 09, 2022 at 08:29:06PM +0000, Sean Christopherson wrote:
> > On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> >
> > One argument is that userspace can simply rely on cgroups to detect misbehaving
> > guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM
> > kill from the host is typically considered a _host_ issue and will be treated as
> > a missed SLO.
> >
> > An idea for handling this in the kernel without too much complexity would be to
> > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from
> > allocating pages, i.e. holes can only be filled by an explicit fallocate().  Minor
> > faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would
> > still work, but writes to previously unreserved/unallocated memory would get a
> > SIGSEGV on something it has mapped.  That would allow the userspace VMM to prevent
> > unintentional allocations without having to coordinate unmapping/remapping across
> > multiple processes.
>
> Since this is mainly for shared memory and the motivation is catching
> misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
> those range backed by private fd as PROT_NONE during the conversion so
> subsequence misbehaved accesses will be blocked instead of causing double
> allocation silently.

This patch series is fairly close to implementing a rather more
efficient solution.  I'm not familiar enough with hypervisor userspace
to really know if this would work, but:

What if shared guest memory could also be file-backed, either in the
same fd or with a second fd covering the shared portion of a memslot?
This would allow changes to the backing store (punching holes, etc) to
be some without mmap_lock or host-userspace TLB flushes?  Depending on
what the guest is doing with its shared memory, userspace might need
the memory mapped or it might not.

--Andy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-14 17:37               ` Andy Lutomirski
@ 2022-06-14 19:08                 ` Sean Christopherson
  2022-06-14 20:59                   ` Andy Lutomirski
  0 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-06-14 19:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chao Peng, Vishal Annapurve, Marc Orr, kvm list, LKML, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Jun Nakajima,
	Dave Hansen, Andi Kleen, David Hildenbrand, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Tue, Jun 14, 2022, Andy Lutomirski wrote:
> On Tue, Jun 14, 2022 at 12:32 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Thu, Jun 09, 2022 at 08:29:06PM +0000, Sean Christopherson wrote:
> > > On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> > >
> > > One argument is that userspace can simply rely on cgroups to detect misbehaving
> > > guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM
> > > kill from the host is typically considered a _host_ issue and will be treated as
> > > a missed SLO.
> > >
> > > An idea for handling this in the kernel without too much complexity would be to
> > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from
> > > allocating pages, i.e. holes can only be filled by an explicit fallocate().  Minor
> > > faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would
> > > still work, but writes to previously unreserved/unallocated memory would get a
> > > SIGSEGV on something it has mapped.  That would allow the userspace VMM to prevent
> > > unintentional allocations without having to coordinate unmapping/remapping across
> > > multiple processes.
> >
> > Since this is mainly for shared memory and the motivation is catching
> > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
> > those range backed by private fd as PROT_NONE during the conversion so
> > subsequence misbehaved accesses will be blocked instead of causing double
> > allocation silently.

PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap().
 
> This patch series is fairly close to implementing a rather more
> efficient solution.  I'm not familiar enough with hypervisor userspace
> to really know if this would work, but:
> 
> What if shared guest memory could also be file-backed, either in the
> same fd or with a second fd covering the shared portion of a memslot?
> This would allow changes to the backing store (punching holes, etc) to
> be some without mmap_lock or host-userspace TLB flushes?  Depending on
> what the guest is doing with its shared memory, userspace might need
> the memory mapped or it might not.

That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea.  The issue,
unless I'm misreading code, is that punching a hole in the shared memory backing
store doesn't prevent reallocating that hole on fault, i.e. a helper process that
keeps a valid mapping of guest shared memory can silently fill the hole.

What we're hoping to achieve is a way to prevent allocating memory without a very
explicit action from userspace, e.g. fallocate().

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-06-02 10:07         ` Chao Peng
@ 2022-06-14 20:23           ` Sean Christopherson
  2022-06-15  8:53             ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:23 UTC (permalink / raw)
  To: Chao Peng
  Cc: Gupta, Pankaj, Vishal Annapurve, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Thu, Jun 02, 2022, Chao Peng wrote:
> On Wed, Jun 01, 2022 at 02:11:42PM +0200, Gupta, Pankaj wrote:
> > 
> > > > > Introduce a new memfd_create() flag indicating the content of the
> > > > > created memfd is inaccessible from userspace through ordinary MMU
> > > > > access (e.g., read/write/mmap). However, the file content can be
> > > > > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > > > > 
> > > > 
> > > > SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
> > > > initial guest boot memory with the needed blobs.
> > > > TDX already supports a KVM IOCTL to transfer contents to private
> > > > memory using the TDX module but rest of the implementations will need
> > > > to invent
> > > > a way to do this.
> > > 
> > > There are some discussions in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F5%2F9%2F1292&amp;data=05%7C01%7Cpankaj.gupta%40amd.com%7Cb81ef334e2dd44c6143308da43b87d17%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637896756895977587%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=oQbM2Hj7GlhJTwnTM%2FPnwsfJlmTL7JR9ULBysAqm6V8%3D&amp;reserved=0
> > > already. I somehow agree with Sean. TDX is using an dedicated ioctl to
> > > copy guest boot memory to private fd so the rest can do that similarly.
> > > The concern is the performance (extra memcpy) but it's trivial since the
> > > initial guest payload is usually optimized in size.
> > > 
> > > > 
> > > > Is there a plan to support a common implementation for either allowing
> > > > initial write access from userspace to private fd or adding a KVM
> > > > IOCTL to transfer contents to such a file,
> > > > as part of this series through future revisions?
> > > 
> > > Indeed, adding pre-boot private memory populating on current design
> > > isn't impossible, but there are still some opens, e.g. how to expose
> > > private fd to userspace for access, pKVM and CC usages may have
> > > different requirements. Before that's well-studied I would tend to not
> > > add that and instead use an ioctl to copy. Whether we need a generic
> > > ioctl or feature-specific ioctl, I don't have strong opinion here.
> > > Current TDX uses a feature-specific ioctl so it's not covered in this
> > > series.
> > 
> > Common function or ioctl to populate preboot private memory actually makes
> > sense.
> > 
> > Sorry, did not follow much of TDX code yet, Is it possible to filter out
> > the current TDX specific ioctl to common function so that it can be used by
> > other technologies?
> 
> TDX code is here:
> https://patchwork.kernel.org/project/kvm/patch/70ed041fd47c1f7571aa259450b3f9244edda48d.1651774250.git.isaku.yamahata@intel.com/
> 
> AFAICS It might be possible to filter that out to a common function. But
> would like to hear from Paolo/Sean for their opinion.

Eh, I wouldn't put too much effort into creating a common helper, I would be very
surprised if TDX and SNP can share a meaningful amount of code that isn't already
shared, e.g. provided by MMU helpers.

The only part I truly care about sharing is whatever ioctl(s) get added, i.e. I
don't want to end up with two ioctls that do the same thing for TDX vs. SNP.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-14 19:08                 ` Sean Christopherson
@ 2022-06-14 20:59                   ` Andy Lutomirski
  2022-06-15  9:17                     ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Andy Lutomirski @ 2022-06-14 20:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Chao Peng, Vishal Annapurve, Marc Orr, kvm list,
	LKML, linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Jun Nakajima, Dave Hansen, Andi Kleen,
	David Hildenbrand, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko

On Tue, Jun 14, 2022 at 12:09 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jun 14, 2022, Andy Lutomirski wrote:
> > On Tue, Jun 14, 2022 at 12:32 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > On Thu, Jun 09, 2022 at 08:29:06PM +0000, Sean Christopherson wrote:
> > > > On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> > > >
> > > > One argument is that userspace can simply rely on cgroups to detect misbehaving
> > > > guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM
> > > > kill from the host is typically considered a _host_ issue and will be treated as
> > > > a missed SLO.
> > > >
> > > > An idea for handling this in the kernel without too much complexity would be to
> > > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from
> > > > allocating pages, i.e. holes can only be filled by an explicit fallocate().  Minor
> > > > faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would
> > > > still work, but writes to previously unreserved/unallocated memory would get a
> > > > SIGSEGV on something it has mapped.  That would allow the userspace VMM to prevent
> > > > unintentional allocations without having to coordinate unmapping/remapping across
> > > > multiple processes.
> > >
> > > Since this is mainly for shared memory and the motivation is catching
> > > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
> > > those range backed by private fd as PROT_NONE during the conversion so
> > > subsequence misbehaved accesses will be blocked instead of causing double
> > > allocation silently.
>
> PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap().
>
> > This patch series is fairly close to implementing a rather more
> > efficient solution.  I'm not familiar enough with hypervisor userspace
> > to really know if this would work, but:
> >
> > What if shared guest memory could also be file-backed, either in the
> > same fd or with a second fd covering the shared portion of a memslot?
> > This would allow changes to the backing store (punching holes, etc) to
> > be some without mmap_lock or host-userspace TLB flushes?  Depending on
> > what the guest is doing with its shared memory, userspace might need
> > the memory mapped or it might not.
>
> That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea.  The issue,
> unless I'm misreading code, is that punching a hole in the shared memory backing
> store doesn't prevent reallocating that hole on fault, i.e. a helper process that
> keeps a valid mapping of guest shared memory can silently fill the hole.
>
> What we're hoping to achieve is a way to prevent allocating memory without a very
> explicit action from userspace, e.g. fallocate().

Ah, I misunderstood.  I thought your goal was to mmap it and prevent
page faults from allocating.

It is indeed the case (and has been since before quite a few of us
were born) that a hole in a sparse file is logically just a bunch of
zeros.  A way to make a file for which a hole is an actual hole seems
like it would solve this problem nicely.  It could also be solved more
specifically for KVM by making sure that the private/shared mode that
userspace programs is strict enough to prevent accidental allocations
-- if a GPA is definitively private, shared, neither, or (potentially,
on TDX only) both, then a page that *isn't* shared will never be
accidentally allocated by KVM.  If the shared backing is not mmapped,
it also won't be accidentally allocated by host userspace on a stray
or careless write.


--Andy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-06-14 20:23           ` Sean Christopherson
@ 2022-06-15  8:53             ` Chao Peng
  0 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-06-15  8:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Gupta, Pankaj, Vishal Annapurve, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Tue, Jun 14, 2022 at 08:23:46PM +0000, Sean Christopherson wrote:
> On Thu, Jun 02, 2022, Chao Peng wrote:
> > On Wed, Jun 01, 2022 at 02:11:42PM +0200, Gupta, Pankaj wrote:
> > > 
> > > > > > Introduce a new memfd_create() flag indicating the content of the
> > > > > > created memfd is inaccessible from userspace through ordinary MMU
> > > > > > access (e.g., read/write/mmap). However, the file content can be
> > > > > > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > > > > > 
> > > > > 
> > > > > SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
> > > > > initial guest boot memory with the needed blobs.
> > > > > TDX already supports a KVM IOCTL to transfer contents to private
> > > > > memory using the TDX module but rest of the implementations will need
> > > > > to invent
> > > > > a way to do this.
> > > > 
> > > > There are some discussions in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F5%2F9%2F1292&amp;data=05%7C01%7Cpankaj.gupta%40amd.com%7Cb81ef334e2dd44c6143308da43b87d17%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637896756895977587%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=oQbM2Hj7GlhJTwnTM%2FPnwsfJlmTL7JR9ULBysAqm6V8%3D&amp;reserved=0
> > > > already. I somehow agree with Sean. TDX is using an dedicated ioctl to
> > > > copy guest boot memory to private fd so the rest can do that similarly.
> > > > The concern is the performance (extra memcpy) but it's trivial since the
> > > > initial guest payload is usually optimized in size.
> > > > 
> > > > > 
> > > > > Is there a plan to support a common implementation for either allowing
> > > > > initial write access from userspace to private fd or adding a KVM
> > > > > IOCTL to transfer contents to such a file,
> > > > > as part of this series through future revisions?
> > > > 
> > > > Indeed, adding pre-boot private memory populating on current design
> > > > isn't impossible, but there are still some opens, e.g. how to expose
> > > > private fd to userspace for access, pKVM and CC usages may have
> > > > different requirements. Before that's well-studied I would tend to not
> > > > add that and instead use an ioctl to copy. Whether we need a generic
> > > > ioctl or feature-specific ioctl, I don't have strong opinion here.
> > > > Current TDX uses a feature-specific ioctl so it's not covered in this
> > > > series.
> > > 
> > > Common function or ioctl to populate preboot private memory actually makes
> > > sense.
> > > 
> > > Sorry, did not follow much of TDX code yet, Is it possible to filter out
> > > the current TDX specific ioctl to common function so that it can be used by
> > > other technologies?
> > 
> > TDX code is here:
> > https://patchwork.kernel.org/project/kvm/patch/70ed041fd47c1f7571aa259450b3f9244edda48d.1651774250.git.isaku.yamahata@intel.com/
> > 
> > AFAICS It might be possible to filter that out to a common function. But
> > would like to hear from Paolo/Sean for their opinion.
> 
> Eh, I wouldn't put too much effort into creating a common helper, I would be very
> surprised if TDX and SNP can share a meaningful amount of code that isn't already
> shared, e.g. provided by MMU helpers.
> 
> The only part I truly care about sharing is whatever ioctl(s) get added, i.e. I
> don't want to end up with two ioctls that do the same thing for TDX vs. SNP.

OK, then that part would be better to be added in TDX or SNP series.

Chao

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-14 20:59                   ` Andy Lutomirski
@ 2022-06-15  9:17                     ` Chao Peng
  2022-06-15 14:29                       ` Sean Christopherson
  0 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-06-15  9:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Vishal Annapurve, Marc Orr, kvm list, LKML,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Jun Nakajima, Dave Hansen, Andi Kleen,
	David Hildenbrand, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko

On Tue, Jun 14, 2022 at 01:59:41PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 14, 2022 at 12:09 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jun 14, 2022, Andy Lutomirski wrote:
> > > On Tue, Jun 14, 2022 at 12:32 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > On Thu, Jun 09, 2022 at 08:29:06PM +0000, Sean Christopherson wrote:
> > > > > On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> > > > >
> > > > > One argument is that userspace can simply rely on cgroups to detect misbehaving
> > > > > guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM
> > > > > kill from the host is typically considered a _host_ issue and will be treated as
> > > > > a missed SLO.
> > > > >
> > > > > An idea for handling this in the kernel without too much complexity would be to
> > > > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from
> > > > > allocating pages, i.e. holes can only be filled by an explicit fallocate().  Minor
> > > > > faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would
> > > > > still work, but writes to previously unreserved/unallocated memory would get a
> > > > > SIGSEGV on something it has mapped.  That would allow the userspace VMM to prevent
> > > > > unintentional allocations without having to coordinate unmapping/remapping across
> > > > > multiple processes.
> > > >
> > > > Since this is mainly for shared memory and the motivation is catching
> > > > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
> > > > those range backed by private fd as PROT_NONE during the conversion so
> > > > subsequence misbehaved accesses will be blocked instead of causing double
> > > > allocation silently.
> >
> > PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap().

Yes, right.

> >
> > > This patch series is fairly close to implementing a rather more
> > > efficient solution.  I'm not familiar enough with hypervisor userspace
> > > to really know if this would work, but:
> > >
> > > What if shared guest memory could also be file-backed, either in the
> > > same fd or with a second fd covering the shared portion of a memslot?
> > > This would allow changes to the backing store (punching holes, etc) to
> > > be some without mmap_lock or host-userspace TLB flushes?  Depending on
> > > what the guest is doing with its shared memory, userspace might need
> > > the memory mapped or it might not.
> >
> > That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea.  The issue,
> > unless I'm misreading code, is that punching a hole in the shared memory backing
> > store doesn't prevent reallocating that hole on fault, i.e. a helper process that
> > keeps a valid mapping of guest shared memory can silently fill the hole.
> >
> > What we're hoping to achieve is a way to prevent allocating memory without a very
> > explicit action from userspace, e.g. fallocate().
> 
> Ah, I misunderstood.  I thought your goal was to mmap it and prevent
> page faults from allocating.

I think we still need the mmap, but want to prevent allocating when
userspace touches previously mmaped area that has never filled the page.
I don't have clear answer if other operations like read/write should be
also prevented (probably yes). And only after an explicit fallocate() to
allocate the page these operations would act normally.

> 
> It is indeed the case (and has been since before quite a few of us
> were born) that a hole in a sparse file is logically just a bunch of
> zeros.  A way to make a file for which a hole is an actual hole seems
> like it would solve this problem nicely.  It could also be solved more
> specifically for KVM by making sure that the private/shared mode that
> userspace programs is strict enough to prevent accidental allocations
> -- if a GPA is definitively private, shared, neither, or (potentially,
> on TDX only) both, then a page that *isn't* shared will never be
> accidentally allocated by KVM.

KVM is clever enough to not allocate since it knows a GPA is shared or
not. This case it's the host userspace that can cause the allocating and
is too complex to check on every access from guest.

> If the shared backing is not mmapped,
> it also won't be accidentally allocated by host userspace on a stray
> or careless write.

As said above, mmap is still prefered, otherwise too many changes are
needed for usespace VMM.

Thanks,
Chao
> 
> 
> --Andy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-15  9:17                     ` Chao Peng
@ 2022-06-15 14:29                       ` Sean Christopherson
  0 siblings, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2022-06-15 14:29 UTC (permalink / raw)
  To: Chao Peng
  Cc: Andy Lutomirski, Vishal Annapurve, Marc Orr, kvm list, LKML,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Jun Nakajima, Dave Hansen, Andi Kleen,
	David Hildenbrand, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko

On Wed, Jun 15, 2022, Chao Peng wrote:
> On Tue, Jun 14, 2022 at 01:59:41PM -0700, Andy Lutomirski wrote:
> > On Tue, Jun 14, 2022 at 12:09 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jun 14, 2022, Andy Lutomirski wrote:
> > > > This patch series is fairly close to implementing a rather more
> > > > efficient solution.  I'm not familiar enough with hypervisor userspace
> > > > to really know if this would work, but:
> > > >
> > > > What if shared guest memory could also be file-backed, either in the
> > > > same fd or with a second fd covering the shared portion of a memslot?
> > > > This would allow changes to the backing store (punching holes, etc) to
> > > > be some without mmap_lock or host-userspace TLB flushes?  Depending on
> > > > what the guest is doing with its shared memory, userspace might need
> > > > the memory mapped or it might not.
> > >
> > > That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea.  The issue,
> > > unless I'm misreading code, is that punching a hole in the shared memory backing
> > > store doesn't prevent reallocating that hole on fault, i.e. a helper process that
> > > keeps a valid mapping of guest shared memory can silently fill the hole.
> > >
> > > What we're hoping to achieve is a way to prevent allocating memory without a very
> > > explicit action from userspace, e.g. fallocate().
> > 
> > Ah, I misunderstood.  I thought your goal was to mmap it and prevent
> > page faults from allocating.

I don't think you misunderstood, that's also one of the goals.  The use case is
that multiple processes in the host mmap() guest memory, and we'd like to be able
to punch a hole without having to rendezvous with all processes and also to prevent
an unintentional re-allocation.

> I think we still need the mmap, but want to prevent allocating when
> userspace touches previously mmaped area that has never filled the page.

Yes, or if a chunk was filled at some point but then was removed via PUNCH_HOLE.

> I don't have clear answer if other operations like read/write should be
> also prevented (probably yes). And only after an explicit fallocate() to
> allocate the page these operations would act normally.

I always forget about read/write.  I believe reads should be ok, the semantics of
holes are that they return zeros, i.e. can use ZERO_PAGE() and not allocate a new
backing page.  Not sure what to do about writes though.  Allocating on direct writes
might be ok for our use case, but that could also result in a rather wierd API.

> > It is indeed the case (and has been since before quite a few of us
> > were born) that a hole in a sparse file is logically just a bunch of
> > zeros.  A way to make a file for which a hole is an actual hole seems
> > like it would solve this problem nicely.  It could also be solved more
> > specifically for KVM by making sure that the private/shared mode that
> > userspace programs is strict enough to prevent accidental allocations
> > -- if a GPA is definitively private, shared, neither, or (potentially,
> > on TDX only) both, then a page that *isn't* shared will never be
> > accidentally allocated by KVM.
> 
> KVM is clever enough to not allocate since it knows a GPA is shared or
> not. This case it's the host userspace that can cause the allocating and
> is too complex to check on every access from guest.

Yes, KVM is not in the picture at all.  KVM won't trigger allocation, but KVM also
is not in a position to prevent userspace from touching memory.

> > If the shared backing is not mmapped,
> > it also won't be accidentally allocated by host userspace on a stray
> > or careless write.
> 
> As said above, mmap is still prefered, otherwise too many changes are
> needed for usespace VMM.

Forcing userspace to change doesn't bother me too much, the biggest concern is
having to take mmap_lock for write in a per-host process.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-19 15:37 ` [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
  2022-05-20 17:57   ` Andy Lutomirski
@ 2022-06-17 20:52   ` Sean Christopherson
  2022-06-17 21:27     ` Sean Christopherson
  2022-06-20 14:08     ` Chao Peng
  1 sibling, 2 replies; 58+ messages in thread
From: Sean Christopherson @ 2022-06-17 20:52 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Thu, May 19, 2022, Chao Peng wrote:
> @@ -653,12 +662,12 @@ struct kvm_irq_routing_table {
>  };
>  #endif
>  
> -#ifndef KVM_PRIVATE_MEM_SLOTS
> -#define KVM_PRIVATE_MEM_SLOTS 0
> +#ifndef KVM_INTERNAL_MEM_SLOTS
> +#define KVM_INTERNAL_MEM_SLOTS 0
>  #endif

This rename belongs in a separate patch.

>  #define KVM_MEM_SLOTS_NUM SHRT_MAX
> -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
> +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
>  
>  #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
>  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> @@ -1087,9 +1096,9 @@ enum kvm_mr_change {
>  };
>  
>  int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem);
> +			  const struct kvm_user_mem_region *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem);
> +			    const struct kvm_user_mem_region *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e10d131edd80..28cacd3656d4 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region {
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>  
> +struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;
> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/* Internal helper, the layout must match above user visible structures */

It's worth explicity calling out which structureso this aliases.  And rather than
add a comment about the layout needing to match that, enforce it in code. I
personally wouldn't bother with an expolicit comment about the layout, IMO that's
a fairly obvious implication of aliasing.

/*
 * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
 * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
 * all fields from the top-level "extended" region.
 */


And I think it's in this patch that you missed a conversion to the alias, in the
prototype for check_memory_region_flags() (looks like it gets fixed up later in
the series).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f81bf0407be..8765b334477d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1466,7 +1466,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
        }
 }

-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
        u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

@@ -4514,6 +4514,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
        return fd;
 }

+#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
+do {                                                                           \
+       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
+                    offsetof(struct kvm_userspace_memory_region, field));      \
+       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
+                    sizeof_field(struct kvm_userspace_memory_region, field));  \
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
+do {                                                                                   \
+       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
+                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
+       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
+                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+       SANITY_CHECK_MEM_REGION_FIELD(slot);
+       SANITY_CHECK_MEM_REGION_FIELD(flags);
+       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
+       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
                           unsigned int ioctl, unsigned long arg)
 {
@@ -4541,6 +4568,8 @@ static long kvm_vm_ioctl(struct file *filp,
                unsigned long size;
                u32 flags;

+               kvm_sanity_check_user_mem_region_alias();
+
                memset(&mem, 0, sizeof(mem));

                r = -EFAULT;

> +struct kvm_user_mem_region {
> +	__u32 slot;
> +	__u32 flags;
> +	__u64 guest_phys_addr;
> +	__u64 memory_size;
> +	__u64 userspace_addr;
> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +};
> +#endif
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>  #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_PRIVATE		(1UL << 2)

Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private
and/or shared memory.  Strictly speaking, we don't actually need a new flag.  Valid
file descriptors must be >=0, so the logic for specifying a memslot that can be
converted between private and shared could be that "(int)private_fd < 0" means
"not convertible", i.e. derive the flag from private_fd.

And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're
both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate
on the _fault_, not the slot.  So it would actually be a positive to not have an easy
way to query if a slot supports conversion.

>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {

...

> +		if (flags & KVM_MEM_PRIVATE) {

An added bonus of dropping KVM_MEM_PRIVATE is that these checks go away.

> +			r = -EINVAL;
> +			goto out;
> +		}
> +
> +		size = sizeof(struct kvm_userspace_memory_region);
> +
> +		if (copy_from_user(&mem, argp, size))
> +			goto out;
> +
> +		r = -EINVAL;
> +		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
>  			goto out;
>  
> -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>  		break;
>  	}
>  	case KVM_GET_DIRTY_LOG: {
> -- 
> 2.25.1
> 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-06-17 20:52   ` Sean Christopherson
@ 2022-06-17 21:27     ` Sean Christopherson
  2022-06-20 14:09       ` Chao Peng
  2022-06-20 14:08     ` Chao Peng
  1 sibling, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-06-17 21:27 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, Jun 17, 2022, Sean Christopherson wrote:
> > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >  #define KVM_MEM_READONLY	(1UL << 1)
> > +#define KVM_MEM_PRIVATE		(1UL << 2)
> 
> Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private
> and/or shared memory.  Strictly speaking, we don't actually need a new flag.  Valid
> file descriptors must be >=0, so the logic for specifying a memslot that can be
> converted between private and shared could be that "(int)private_fd < 0" means
> "not convertible", i.e. derive the flag from private_fd.
> 
> And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're
> both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate
> on the _fault_, not the slot.  So it would actually be a positive to not have an easy
> way to query if a slot supports conversion.

I take that back, the usage in kvm_faultin_pfn() is correct, but the names ends
up being confusing because it suggests that it always faults in a private pfn.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b6d75016e48c..e1008f00609d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4045,7 +4045,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
                        return RET_PF_EMULATE;
        }

-       if (fault->is_private) {
+       if (kvm_slot_can_be_private(slot)) {
                r = kvm_faultin_pfn_private(vcpu, fault);
                if (r != RET_PF_CONTINUE)
                        return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31f704c83099..c5126190fb71 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -583,9 +583,9 @@ struct kvm_memory_slot {
        struct kvm *kvm;
 };

-static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
 {
-       return slot && (slot->flags & KVM_MEM_PRIVATE);
+       return slot && !!slot->private_file;
 }

 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-05-19 15:37 ` [PATCH v6 6/8] KVM: Handle page fault for private memory Chao Peng
@ 2022-06-17 21:30   ` Sean Christopherson
  2022-06-20 14:16     ` Chao Peng
  2022-08-19  0:40     ` Kirill A. Shutemov
  2022-06-24  3:58   ` Nikunj A. Dadhania
  1 sibling, 2 replies; 58+ messages in thread
From: Sean Christopherson @ 2022-06-17 21:30 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Thu, May 19, 2022, Chao Peng wrote:
> @@ -4028,8 +4081,11 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>  	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
>  		return true;
>  
> -	return fault->slot &&
> -	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +	if (fault->is_private)
> +		return mmu_notifier_retry(vcpu->kvm, mmu_seq);

Hmm, this is somewhat undesirable, because faulting in private pfns will be blocked
by unrelated mmu_notifier updates.  The issue is mitigated to some degree by bumping
the sequence count if and only if overlap with a memslot is detected, e.g. mapping
changes that affects only userspace won't block the guest.

It probably won't be an issue, but at the same time it's easy to solve, and I don't
like piggybacking mmu_notifier_seq as private mappings shouldn't be subject to the
mmu_notifier.

That would also fix a theoretical bug in this patch where mmu_notifier_retry()
wouldn't be defined if CONFIG_MEMFILE_NOTIFIER=y && CONFIG_MMU_NOTIFIER=n.a

---
 arch/x86/kvm/mmu/mmu.c   | 11 ++++++-----
 include/linux/kvm_host.h | 16 +++++++++++-----
 virt/kvm/kvm_main.c      |  2 +-
 3 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0b455c16ec64..a4cbd29433e7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4100,10 +4100,10 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;

 	if (fault->is_private)
-		return mmu_notifier_retry(vcpu->kvm, mmu_seq);
-	else
-		return fault->slot &&
-			mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+		return memfile_notifier_retry(vcpu->kvm, mmu_seq);
+
+	return fault->slot &&
+	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
 }

 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -4127,7 +4127,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (r)
 		return r;

-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	mmu_seq = fault->is_private ? vcpu->kvm->memfile_notifier_seq :
+				      vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();

 	r = kvm_faultin_pfn(vcpu, fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 92afa5bddbc5..31f704c83099 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -773,16 +773,15 @@ struct kvm {
 	struct hlist_head irq_ack_notifier_list;
 #endif

-#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) ||\
-	defined(CONFIG_MEMFILE_NOTIFIER)
+#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER))
 	unsigned long mmu_notifier_seq;
-#endif
-
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 	struct mmu_notifier mmu_notifier;
 	long mmu_notifier_count;
 	unsigned long mmu_notifier_range_start;
 	unsigned long mmu_notifier_range_end;
+#endif
+#ifdef CONFIG_MEMFILE_NOTIFIER
+	unsigned long memfile_notifier_seq;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1964,6 +1963,13 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm,
 }
 #endif

+#ifdef CONFIG_MEMFILE_NOTIFIER
+static inline bool memfile_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
+{
+	return kvm->memfile_notifier_seq != mmu_seq;
+}
+#endif
+
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING

 #define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2b416d3bd60e..e6d34c964d51 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -898,7 +898,7 @@ static void kvm_private_mem_notifier_handler(struct memfile_notifier *notifier,
 	KVM_MMU_LOCK(kvm);
 	if (kvm_unmap_gfn_range(kvm, &gfn_range))
 		kvm_flush_remote_tlbs(kvm);
-	kvm->mmu_notifier_seq++;
+	kvm->memfile_notifier_seq++;
 	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 }

base-commit: 333ef501c7f6c6d4ef2b7678905cad0f8ef3e271
--

> +	else
> +		return fault->slot &&
> +			mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
>  }
>  
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -4088,7 +4144,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  		read_unlock(&vcpu->kvm->mmu_lock);
>  	else
>  		write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +
> +	if (fault->is_private)
> +		kvm_private_mem_put_pfn(fault->slot, fault->pfn);

Why does the shmem path lock the page, and then unlock it here?

Same question for why this path marks it dirty?  The guest has the page mapped
so the dirty flag is immediately stale.

In other words, why does KVM need to do something different for private pfns?

> +	else
> +		kvm_release_pfn_clean(fault->pfn);
> +
>  	return r;
>  }
>  

...

> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 7f8f1c8dbed2..1d857919a947 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -878,7 +878,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_private)

Indirect MMUs can't support private faults, i.e. this is unnecessary.

> +		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> +	else
> +		kvm_release_pfn_clean(fault->pfn);
>  	return r;
>  }
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3fd168972ecd..b0a7910505ed 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2241,4 +2241,26 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
> +					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> +	int ret;
> +	pfn_t pfnt;
> +	pgoff_t index = gfn - slot->base_gfn +
> +			(slot->private_offset >> PAGE_SHIFT);
> +
> +	ret = slot->notifier.bs->get_lock_pfn(slot->private_file, index, &pfnt,
> +						order);
> +	*pfn = pfn_t_to_pfn(pfnt);
> +	return ret;
> +}
> +
> +static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
> +					   kvm_pfn_t pfn)
> +{
> +	slot->notifier.bs->put_unlock_pfn(pfn_to_pfn_t(pfn));
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
>  #endif
> -- 
> 2.25.1
> 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-06-17 20:52   ` Sean Christopherson
  2022-06-17 21:27     ` Sean Christopherson
@ 2022-06-20 14:08     ` Chao Peng
  1 sibling, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-06-20 14:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, Jun 17, 2022 at 08:52:15PM +0000, Sean Christopherson wrote:
> On Thu, May 19, 2022, Chao Peng wrote:
> > @@ -653,12 +662,12 @@ struct kvm_irq_routing_table {
> >  };
> >  #endif
> >  
> > -#ifndef KVM_PRIVATE_MEM_SLOTS
> > -#define KVM_PRIVATE_MEM_SLOTS 0
> > +#ifndef KVM_INTERNAL_MEM_SLOTS
> > +#define KVM_INTERNAL_MEM_SLOTS 0
> >  #endif
> 
> This rename belongs in a separate patch.

Will separate it out, thanks.

> 
> >  #define KVM_MEM_SLOTS_NUM SHRT_MAX
> > -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
> > +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
> >  
> >  #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> >  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> > @@ -1087,9 +1096,9 @@ enum kvm_mr_change {
> >  };
> >  
> >  int kvm_set_memory_region(struct kvm *kvm,
> > -			  const struct kvm_userspace_memory_region *mem);
> > +			  const struct kvm_user_mem_region *mem);
> >  int __kvm_set_memory_region(struct kvm *kvm,
> > -			    const struct kvm_userspace_memory_region *mem);
> > +			    const struct kvm_user_mem_region *mem);
> >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e10d131edd80..28cacd3656d4 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region {
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >  };
> >  
> > +struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/* Internal helper, the layout must match above user visible structures */
> 
> It's worth explicity calling out which structureso this aliases.  And rather than
> add a comment about the layout needing to match that, enforce it in code. I
> personally wouldn't bother with an expolicit comment about the layout, IMO that's
> a fairly obvious implication of aliasing.
> 
> /*
>  * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
>  * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
>  * all fields from the top-level "extended" region.
>  */
> 

Thanks.

> 
> And I think it's in this patch that you missed a conversion to the alias, in the
> prototype for check_memory_region_flags() (looks like it gets fixed up later in
> the series).
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0f81bf0407be..8765b334477d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1466,7 +1466,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
> 
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> 
> @@ -4514,6 +4514,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>         return fd;
>  }
> 
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> +do {                                                                           \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> +                    offsetof(struct kvm_userspace_memory_region, field));      \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> +do {                                                                                   \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -4541,6 +4568,8 @@ static long kvm_vm_ioctl(struct file *filp,
>                 unsigned long size;
>                 u32 flags;
> 
> +               kvm_sanity_check_user_mem_region_alias();
> +
>                 memset(&mem, 0, sizeof(mem));
> 
>                 r = -EFAULT;
> 
> > +struct kvm_user_mem_region {
> > +	__u32 slot;
> > +	__u32 flags;
> > +	__u64 guest_phys_addr;
> > +	__u64 memory_size;
> > +	__u64 userspace_addr;
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +};
> > +#endif
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >  #define KVM_MEM_READONLY	(1UL << 1)
> > +#define KVM_MEM_PRIVATE		(1UL << 2)
> 
> Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private
> and/or shared memory.  Strictly speaking, we don't actually need a new flag.  Valid
> file descriptors must be >=0, so the logic for specifying a memslot that can be
> converted between private and shared could be that "(int)private_fd < 0" means
> "not convertible", i.e. derive the flag from private_fd.

I think a flag is still needed, the problem is private_fd can be safely
accessed only when this flag is set, e.g. without this flag, we can't
copy_from_user these new fields since they don't exist for previous
kvm_userspace_memory_region callers.

> 
> And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're
> both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate
> on the _fault_, not the slot.  So it would actually be a positive to not have an easy
> way to query if a slot supports conversion.
> 
> >  /* for KVM_IRQ_LINE */
> >  struct kvm_irq_level {
> 
> ...
> 
> > +		if (flags & KVM_MEM_PRIVATE) {
> 
> An added bonus of dropping KVM_MEM_PRIVATE is that these checks go away.
> 
> > +			r = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		size = sizeof(struct kvm_userspace_memory_region);
> > +
> > +		if (copy_from_user(&mem, argp, size))
> > +			goto out;
> > +
> > +		r = -EINVAL;
> > +		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
> >  			goto out;
> >  
> > -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > +		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >  		break;
> >  	}
> >  	case KVM_GET_DIRTY_LOG: {
> > -- 
> > 2.25.1
> > 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-06-17 21:27     ` Sean Christopherson
@ 2022-06-20 14:09       ` Chao Peng
  0 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-06-20 14:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, Jun 17, 2022 at 09:27:25PM +0000, Sean Christopherson wrote:
> On Fri, Jun 17, 2022, Sean Christopherson wrote:
> > > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
> > >   */
> > >  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> > >  #define KVM_MEM_READONLY	(1UL << 1)
> > > +#define KVM_MEM_PRIVATE		(1UL << 2)
> > 
> > Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private
> > and/or shared memory.  Strictly speaking, we don't actually need a new flag.  Valid
> > file descriptors must be >=0, so the logic for specifying a memslot that can be
> > converted between private and shared could be that "(int)private_fd < 0" means
> > "not convertible", i.e. derive the flag from private_fd.
> > 
> > And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're
> > both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate
> > on the _fault_, not the slot.  So it would actually be a positive to not have an easy
> > way to query if a slot supports conversion.
> 
> I take that back, the usage in kvm_faultin_pfn() is correct, but the names ends
> up being confusing because it suggests that it always faults in a private pfn.

Make sense, will change the naming, thanks.

> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b6d75016e48c..e1008f00609d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4045,7 +4045,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                         return RET_PF_EMULATE;
>         }
> 
> -       if (fault->is_private) {
> +       if (kvm_slot_can_be_private(slot)) {
>                 r = kvm_faultin_pfn_private(vcpu, fault);
>                 if (r != RET_PF_CONTINUE)
>                         return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 31f704c83099..c5126190fb71 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -583,9 +583,9 @@ struct kvm_memory_slot {
>         struct kvm *kvm;
>  };
> 
> -static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
>  {
> -       return slot && (slot->flags & KVM_MEM_PRIVATE);
> +       return slot && !!slot->private_file;
>  }
> 
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-06-17 21:30   ` Sean Christopherson
@ 2022-06-20 14:16     ` Chao Peng
  2022-08-19  0:40     ` Kirill A. Shutemov
  1 sibling, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-06-20 14:16 UTC (permalink / raw)
  To: Sean Christopherson, kirill.shutemov
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, Jun 17, 2022 at 09:30:53PM +0000, Sean Christopherson wrote:
> On Thu, May 19, 2022, Chao Peng wrote:
> > @@ -4028,8 +4081,11 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> >  	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> >  		return true;
> >  
> > -	return fault->slot &&
> > -	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > +	if (fault->is_private)
> > +		return mmu_notifier_retry(vcpu->kvm, mmu_seq);
> 
> Hmm, this is somewhat undesirable, because faulting in private pfns will be blocked
> by unrelated mmu_notifier updates.  The issue is mitigated to some degree by bumping
> the sequence count if and only if overlap with a memslot is detected, e.g. mapping
> changes that affects only userspace won't block the guest.
> 
> It probably won't be an issue, but at the same time it's easy to solve, and I don't
> like piggybacking mmu_notifier_seq as private mappings shouldn't be subject to the
> mmu_notifier.
> 
> That would also fix a theoretical bug in this patch where mmu_notifier_retry()
> wouldn't be defined if CONFIG_MEMFILE_NOTIFIER=y && CONFIG_MMU_NOTIFIER=n.a

Agreed, Thanks.

> 
> ---
>  arch/x86/kvm/mmu/mmu.c   | 11 ++++++-----
>  include/linux/kvm_host.h | 16 +++++++++++-----
>  virt/kvm/kvm_main.c      |  2 +-
>  3 files changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0b455c16ec64..a4cbd29433e7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4100,10 +4100,10 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>  		return true;
> 
>  	if (fault->is_private)
> -		return mmu_notifier_retry(vcpu->kvm, mmu_seq);
> -	else
> -		return fault->slot &&
> -			mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +		return memfile_notifier_retry(vcpu->kvm, mmu_seq);
> +
> +	return fault->slot &&
> +	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
>  }
> 
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -4127,7 +4127,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	if (r)
>  		return r;
> 
> -	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> +	mmu_seq = fault->is_private ? vcpu->kvm->memfile_notifier_seq :
> +				      vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
> 
>  	r = kvm_faultin_pfn(vcpu, fault);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 92afa5bddbc5..31f704c83099 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -773,16 +773,15 @@ struct kvm {
>  	struct hlist_head irq_ack_notifier_list;
>  #endif
> 
> -#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) ||\
> -	defined(CONFIG_MEMFILE_NOTIFIER)
> +#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER))
>  	unsigned long mmu_notifier_seq;
> -#endif
> -
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  	struct mmu_notifier mmu_notifier;
>  	long mmu_notifier_count;
>  	unsigned long mmu_notifier_range_start;
>  	unsigned long mmu_notifier_range_end;
> +#endif
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +	unsigned long memfile_notifier_seq;
>  #endif
>  	struct list_head devices;
>  	u64 manual_dirty_log_protect;
> @@ -1964,6 +1963,13 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm,
>  }
>  #endif
> 
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +static inline bool memfile_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
> +{
> +	return kvm->memfile_notifier_seq != mmu_seq;
> +}
> +#endif
> +
>  #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> 
>  #define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2b416d3bd60e..e6d34c964d51 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -898,7 +898,7 @@ static void kvm_private_mem_notifier_handler(struct memfile_notifier *notifier,
>  	KVM_MMU_LOCK(kvm);
>  	if (kvm_unmap_gfn_range(kvm, &gfn_range))
>  		kvm_flush_remote_tlbs(kvm);
> -	kvm->mmu_notifier_seq++;
> +	kvm->memfile_notifier_seq++;
>  	KVM_MMU_UNLOCK(kvm);
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
> 
> base-commit: 333ef501c7f6c6d4ef2b7678905cad0f8ef3e271
> --
> 
> > +	else
> > +		return fault->slot &&
> > +			mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> >  }
> >  
> >  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > @@ -4088,7 +4144,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  		read_unlock(&vcpu->kvm->mmu_lock);
> >  	else
> >  		write_unlock(&vcpu->kvm->mmu_lock);
> > -	kvm_release_pfn_clean(fault->pfn);
> > +
> > +	if (fault->is_private)
> > +		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> 
> Why does the shmem path lock the page, and then unlock it here?

Initially this is to prevent race between SLPT population and
truncate/punch on the fd. Without this, a gfn may become stale before
the page is populated in SLPT. However, with memfile_notifier_retry
mechanism, this sounds not needed.

> 
> Same question for why this path marks it dirty?  The guest has the page mapped
> so the dirty flag is immediately stale.

I believe so.

> 
> In other words, why does KVM need to do something different for private pfns?

These two are inherited from Kirill's previous code. See if he has any
comment.

> 
> > +	else
> > +		kvm_release_pfn_clean(fault->pfn);
> > +
> >  	return r;
> >  }
> >  
> 
> ...
> 
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 7f8f1c8dbed2..1d857919a947 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -878,7 +878,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  
> >  out_unlock:
> >  	write_unlock(&vcpu->kvm->mmu_lock);
> > -	kvm_release_pfn_clean(fault->pfn);
> > +	if (fault->is_private)
> 
> Indirect MMUs can't support private faults, i.e. this is unnecessary.

Okay.

> 
> > +		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> > +	else
> > +		kvm_release_pfn_clean(fault->pfn);
> >  	return r;
> >  }
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3fd168972ecd..b0a7910505ed 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2241,4 +2241,26 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >  
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
> > +					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > +	int ret;
> > +	pfn_t pfnt;
> > +	pgoff_t index = gfn - slot->base_gfn +
> > +			(slot->private_offset >> PAGE_SHIFT);
> > +
> > +	ret = slot->notifier.bs->get_lock_pfn(slot->private_file, index, &pfnt,
> > +						order);
> > +	*pfn = pfn_t_to_pfn(pfnt);
> > +	return ret;
> > +}
> > +
> > +static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
> > +					   kvm_pfn_t pfn)
> > +{
> > +	slot->notifier.bs->put_unlock_pfn(pfn_to_pfn_t(pfn));
> > +}
> > +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> > +
> >  #endif
> > -- 
> > 2.25.1
> > 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 7/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-05-19 15:37 ` [PATCH v6 7/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-06-23 22:07   ` Michael Roth
  2022-06-24  8:43     ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Roth @ 2022-06-23 22:07 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, mhocko

On Thu, May 19, 2022 at 11:37:12PM +0800, Chao Peng wrote:
> Register private memslot to fd-based memory backing store and handle the
> memfile notifiers to zap the existing mappings.
> 
> Currently the register is happened at memslot creating time and the
> initial support does not include page migration/swap.
> 
> KVM_MEM_PRIVATE is not exposed by default, architecture code can turn
> on it by implementing kvm_arch_private_mem_supported().
> 
> A 'kvm' reference is added in memslot structure since in
> memfile_notifier callbacks we can only obtain a memslot reference while
> kvm is need to do the zapping. The zapping itself reuses code from
> existing mmu notifier handling.
> 
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |  10 ++-
>  virt/kvm/kvm_main.c      | 132 ++++++++++++++++++++++++++++++++++++---
>  2 files changed, 131 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b0a7910505ed..00efb4b96bc7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -246,7 +246,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER)
>  struct kvm_gfn_range {
>  	struct kvm_memory_slot *slot;
>  	gfn_t start;
> @@ -577,6 +577,7 @@ struct kvm_memory_slot {
>  	struct file *private_file;
>  	loff_t private_offset;
>  	struct memfile_notifier notifier;
> +	struct kvm *kvm;
>  };
>  
>  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> @@ -769,9 +770,13 @@ struct kvm {
>  	struct hlist_head irq_ack_notifier_list;
>  #endif
>  
> +#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) ||\
> +	defined(CONFIG_MEMFILE_NOTIFIER)
> +	unsigned long mmu_notifier_seq;
> +#endif
> +
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  	struct mmu_notifier mmu_notifier;
> -	unsigned long mmu_notifier_seq;
>  	long mmu_notifier_count;
>  	unsigned long mmu_notifier_range_start;
>  	unsigned long mmu_notifier_range_end;
> @@ -1438,6 +1443,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_private_mem_supported(struct kvm *kvm);
>  
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index db9d39a2d3a6..f93ac7cdfb53 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -843,6 +843,73 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>  
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +static void kvm_private_mem_notifier_handler(struct memfile_notifier *notifier,
> +					     pgoff_t start, pgoff_t end)
> +{
> +	int idx;
> +	struct kvm_memory_slot *slot = container_of(notifier,
> +						    struct kvm_memory_slot,
> +						    notifier);
> +	struct kvm_gfn_range gfn_range = {
> +		.slot		= slot,
> +		.start		= start - (slot->private_offset >> PAGE_SHIFT),
> +		.end		= end - (slot->private_offset >> PAGE_SHIFT),

This code assumes that 'end' is greater than slot->private_offset, but
even if slot->private_offset is non-zero, nothing stops userspace from
allocating pages in the range of 0 through slot->private_offset, which
will still end up triggering this notifier. In that case gfn_range.end
will end up going negative, and the below code will limit that to
slot->npages and do a populate/invalidate for the entire range.

Not sure if this covers all the cases, but this fixes the issue for me:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 903ffdb5f01c..4c744d8f7527 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -872,6 +872,19 @@ static void kvm_private_mem_notifier_handler(struct memfile_notifier *notifier,
                .may_block      = true,
        };

        struct kvm *kvm = slot->kvm;
+
+       if (slot->private_offset > end)
+               return;
+


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-05-20 18:31     ` Sean Christopherson
  2022-05-22  4:03       ` Andy Lutomirski
  2022-05-23 13:21       ` Chao Peng
@ 2022-06-23 22:59       ` Michael Roth
  2022-06-24  8:54         ` Chao Peng
  2 siblings, 1 reply; 58+ messages in thread
From: Michael Roth @ 2022-06-23 22:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, mhocko, Nikunj A. Dadhania

On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> On Fri, May 20, 2022, Andy Lutomirski wrote:
> > The alternative would be to have some kind of separate table or bitmap (part
> > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > 
> > What do you all think?
> 
> My original proposal was to have expolicit shared vs. private memslots, and punch
> holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> handle memslot updates, conversions would be painfully slow.  That's how we ended
> up with the current propsoal.
> 
> But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> and wouldn't necessarily even need to interact with the memslots.  It could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions without an
> associated memslot, but I think we'd want to avoid even that because things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> memory region is temporarily removed then we wouldn't want to destroy the tracking.
> 
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.
> 
> One benefit to explicitly tracking this in KVM is that it might be useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> based on guest hypercalls to share/unshare memory, and then complete the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

Another upside to implementing a KVM ioctl is basically the reverse of the
discussion around avoiding double-allocations: *supporting* double-allocations.

One thing I noticed while testing SNP+UPM support is a fairly dramatic
slow-down with how it handles OVMF, which does some really nasty stuff
with DMA where it takes 1 or 2 pages and flips them between
shared/private on every transaction. Obviously that's not ideal and
should be fixed directly at some point, but it's something that exists in the
wild and might not be the only such instance where we need to deal with that
sort of usage pattern. 

With the current implementation, one option I had to address this was to
disable hole-punching in QEMU when doing shared->private conversions:

Boot time from 1GB guest:
                               SNP:   32s
                           SNP+UPM: 1m43s
  SNP+UPM (disable shared discard): 1m08s

Of course, we don't have the option of disabling discard/hole-punching
for private memory to see if we get similar gains there, since that also
doubles as the interface for doing private->shared conversions. A separate
KVM ioctl to decouple these 2 things would allow for that, and allow for a
way for userspace to implement things like batched/lazy-discard of
previously-converted pages to deal with cases like these.

Another motivator for these separate ioctl is that, since we're considering
'out-of-band' interactions with private memfd where userspace might
erroneously/inadvertently do things like double allocations, another thing it
might do is pre-allocating pages in the private memfd prior to associating
the memfd with a private memslot. Since the notifiers aren't registered until
that point, any associated callbacks that would normally need to be done as
part of those fallocate() notification would be missed unless we do something
like 'replay' all the notifications once the private memslot is registered and
associating with a memfile notifier. But that seems a bit ugly, and I'm not
sure how well that would work. This also seems to hint at this additional
'conversion' state being something that should be owned and managed directly
by KVM rather than hooking into the allocations.

It would also nicely solve the question of how to handle in-place
encryption, since unlike userspace, KVM is perfectly capable of copying
data from shared->private prior to conversion / guest start, and
disallowing such things afterward. Would just need an extra flag basically.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-05-19 15:37 ` [PATCH v6 6/8] KVM: Handle page fault for private memory Chao Peng
  2022-06-17 21:30   ` Sean Christopherson
@ 2022-06-24  3:58   ` Nikunj A. Dadhania
  2022-06-24  9:02     ` Chao Peng
  1 sibling, 1 reply; 58+ messages in thread
From: Nikunj A. Dadhania @ 2022-06-24  3:58 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On 5/19/2022 9:07 PM, Chao Peng wrote:
> A page fault can carry the information of whether the access if private
> or not for KVM_MEM_PRIVATE memslot, this can be filled by architecture
> code(like TDX code). To handle page faut for such access, KVM maps the
> page only when this private property matches host's view on this page
> which can be decided by checking whether the corresponding page is
> populated in the private fd or not. A page is considered as private when
> the page is populated in the private fd, otherwise it's shared.
> 
> For a successful match, private pfn is obtained with memfile_notifier
> callbacks from private fd and shared pfn is obtained with existing
> get_user_pages.
> 
> For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> userspace. Userspace then can convert memory between private/shared from
> host's view then retry the access.
> 
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/kvm/mmu.h              |  1 +
>  arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/mmu_internal.h | 17 ++++++++
>  arch/x86/kvm/mmu/mmutrace.h     |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  5 ++-
>  include/linux/kvm_host.h        | 22 +++++++++++
>  6 files changed, 112 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 7e258cc94152..c84835762249 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -176,6 +176,7 @@ struct kvm_page_fault {
>  
>  	/* Derived from mmu and global state.  */
>  	const bool is_tdp;
> +	const bool is_private;
>  	const bool nx_huge_page_workaround_enabled;
>  
>  	/*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index afe18d70ece7..e18460e0d743 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2899,6 +2899,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  	if (max_level == PG_LEVEL_4K)
>  		return PG_LEVEL_4K;
>  
> +	if (kvm_slot_is_private(slot))
> +		return max_level;

Can you explain the rationale behind the above change? 
AFAIU, this overrides the transparent_hugepage=never setting for both 
shared and private mappings.

>  	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
>  	return min(host_level, max_level);
>  }

Regards
Nikunj

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 7/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-06-23 22:07   ` Michael Roth
@ 2022-06-24  8:43     ` Chao Peng
  0 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-06-24  8:43 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, mhocko

On Thu, Jun 23, 2022 at 05:07:51PM -0500, Michael Roth wrote:
...
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index db9d39a2d3a6..f93ac7cdfb53 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -843,6 +843,73 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >  
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >  
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +static void kvm_private_mem_notifier_handler(struct memfile_notifier *notifier,
> > +					     pgoff_t start, pgoff_t end)
> > +{
> > +	int idx;
> > +	struct kvm_memory_slot *slot = container_of(notifier,
> > +						    struct kvm_memory_slot,
> > +						    notifier);
> > +	struct kvm_gfn_range gfn_range = {
> > +		.slot		= slot,
> > +		.start		= start - (slot->private_offset >> PAGE_SHIFT),
> > +		.end		= end - (slot->private_offset >> PAGE_SHIFT),
> 
> This code assumes that 'end' is greater than slot->private_offset, but
> even if slot->private_offset is non-zero, nothing stops userspace from
> allocating pages in the range of 0 through slot->private_offset, which
> will still end up triggering this notifier. In that case gfn_range.end
> will end up going negative, and the below code will limit that to
> slot->npages and do a populate/invalidate for the entire range.
> 
> Not sure if this covers all the cases, but this fixes the issue for me:

Right, already noticed this issue, will fix in next version. Thanks.

> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 903ffdb5f01c..4c744d8f7527 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -872,6 +872,19 @@ static void kvm_private_mem_notifier_handler(struct memfile_notifier *notifier,
>                 .may_block      = true,
>         };
> 
>         struct kvm *kvm = slot->kvm;
> +
> +       if (slot->private_offset > end)
> +               return;
> +
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-06-23 22:59       ` Michael Roth
@ 2022-06-24  8:54         ` Chao Peng
  2022-06-24 13:01           ` Michael Roth
  0 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-06-24  8:54 UTC (permalink / raw)
  To: Michael Roth
  Cc: Sean Christopherson, Andy Lutomirski, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, mhocko,
	Nikunj A. Dadhania

On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote:
> On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > The alternative would be to have some kind of separate table or bitmap (part
> > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > 
> > > What do you all think?
> > 
> > My original proposal was to have expolicit shared vs. private memslots, and punch
> > holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> > handle memslot updates, conversions would be painfully slow.  That's how we ended
> > up with the current propsoal.
> > 
> > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> > and wouldn't necessarily even need to interact with the memslots.  It could be a
> > consumer of memslots, e.g. if we wanted to disallow registering regions without an
> > associated memslot, but I think we'd want to avoid even that because things will
> > get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> > memory region is temporarily removed then we wouldn't want to destroy the tracking.
> > 
> > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> > should be far more efficient.
> > 
> > One benefit to explicitly tracking this in KVM is that it might be useful for
> > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> > based on guest hypercalls to share/unshare memory, and then complete the transaction
> > when userspace invokes the ioctl() to complete the share/unshare.
> 
> Another upside to implementing a KVM ioctl is basically the reverse of the
> discussion around avoiding double-allocations: *supporting* double-allocations.
> 
> One thing I noticed while testing SNP+UPM support is a fairly dramatic
> slow-down with how it handles OVMF, which does some really nasty stuff
> with DMA where it takes 1 or 2 pages and flips them between
> shared/private on every transaction. Obviously that's not ideal and
> should be fixed directly at some point, but it's something that exists in the
> wild and might not be the only such instance where we need to deal with that
> sort of usage pattern. 
> 
> With the current implementation, one option I had to address this was to
> disable hole-punching in QEMU when doing shared->private conversions:
> 
> Boot time from 1GB guest:
>                                SNP:   32s
>                            SNP+UPM: 1m43s
>   SNP+UPM (disable shared discard): 1m08s
> 
> Of course, we don't have the option of disabling discard/hole-punching
> for private memory to see if we get similar gains there, since that also
> doubles as the interface for doing private->shared conversions.

Private should be the same, minus time consumed for private memory, the
data should be close to SNP case. You can't try that in current version
due to we rely on the existence of the private page to tell a page is
private.

> A separate
> KVM ioctl to decouple these 2 things would allow for that, and allow for a
> way for userspace to implement things like batched/lazy-discard of
> previously-converted pages to deal with cases like these.

The planned ioctl includes two responsibilities:
  - Mark the range as private/shared
  - Zap the existing SLPT mapping for the range

Whether doing the hole-punching or not on the fd is unrelated to this
ioctl, userspace has freedom to do that or not. Since we don't reply on
the fact that private memoy should have been allocated, we can support
lazy faulting and don't need explicit fallocate(). That means, whether
the memory is discarded or not in the memory backing store is not
required by KVM, but be a userspace option.

> 
> Another motivator for these separate ioctl is that, since we're considering
> 'out-of-band' interactions with private memfd where userspace might
> erroneously/inadvertently do things like double allocations, another thing it
> might do is pre-allocating pages in the private memfd prior to associating
> the memfd with a private memslot. Since the notifiers aren't registered until
> that point, any associated callbacks that would normally need to be done as
> part of those fallocate() notification would be missed unless we do something
> like 'replay' all the notifications once the private memslot is registered and
> associating with a memfile notifier. But that seems a bit ugly, and I'm not
> sure how well that would work. This also seems to hint at this additional
> 'conversion' state being something that should be owned and managed directly
> by KVM rather than hooking into the allocations.

Right, once we move the private/shared state into KVM then we don't rely
on those callbacks so the 'replay' thing is unneeded. fallocate()
notification is useless for sure, invalidate() is likely still needed,
just like the invalidate for mmu_notifier to bump the mmu_seq and do the
zap.

> 
> It would also nicely solve the question of how to handle in-place
> encryption, since unlike userspace, KVM is perfectly capable of copying
> data from shared->private prior to conversion / guest start, and
> disallowing such things afterward. Would just need an extra flag basically.

Agree it's possible to do additional copy during the conversion but I'm
not so confident this is urgent and the right API. Currently TDX does
not have this need. Maybe as the first step just add the conversion
itself. Adding additional feature like this can always be possible
whenever we are clear.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-06-24  3:58   ` Nikunj A. Dadhania
@ 2022-06-24  9:02     ` Chao Peng
  2022-06-30 19:14       ` Vishal Annapurve
  0 siblings, 1 reply; 58+ messages in thread
From: Chao Peng @ 2022-06-24  9:02 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, Jun 24, 2022 at 09:28:23AM +0530, Nikunj A. Dadhania wrote:
> On 5/19/2022 9:07 PM, Chao Peng wrote:
> > A page fault can carry the information of whether the access if private
> > or not for KVM_MEM_PRIVATE memslot, this can be filled by architecture
> > code(like TDX code). To handle page faut for such access, KVM maps the
> > page only when this private property matches host's view on this page
> > which can be decided by checking whether the corresponding page is
> > populated in the private fd or not. A page is considered as private when
> > the page is populated in the private fd, otherwise it's shared.
> > 
> > For a successful match, private pfn is obtained with memfile_notifier
> > callbacks from private fd and shared pfn is obtained with existing
> > get_user_pages.
> > 
> > For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared from
> > host's view then retry the access.
> > 
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/kvm/mmu.h              |  1 +
> >  arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/mmu/mmu_internal.h | 17 ++++++++
> >  arch/x86/kvm/mmu/mmutrace.h     |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  5 ++-
> >  include/linux/kvm_host.h        | 22 +++++++++++
> >  6 files changed, 112 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 7e258cc94152..c84835762249 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -176,6 +176,7 @@ struct kvm_page_fault {
> >  
> >  	/* Derived from mmu and global state.  */
> >  	const bool is_tdp;
> > +	const bool is_private;
> >  	const bool nx_huge_page_workaround_enabled;
> >  
> >  	/*
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index afe18d70ece7..e18460e0d743 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2899,6 +2899,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  	if (max_level == PG_LEVEL_4K)
> >  		return PG_LEVEL_4K;
> >  
> > +	if (kvm_slot_is_private(slot))
> > +		return max_level;
> 
> Can you explain the rationale behind the above change? 
> AFAIU, this overrides the transparent_hugepage=never setting for both 
> shared and private mappings.

As Sean pointed out, this should check against fault->is_private instead
of the slot. For private fault, the level is retrieved and stored to
fault->max_level in kvm_faultin_pfn_private() instead of here.

For shared fault, it will continue to query host_level below. For
private fault, the host level has already been accounted in
kvm_faultin_pfn_private().

Chao
> 
> >  	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> >  	return min(host_level, max_level);
> >  }
> 
> Regards
> Nikunj

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
  2022-06-24  8:54         ` Chao Peng
@ 2022-06-24 13:01           ` Michael Roth
  0 siblings, 0 replies; 58+ messages in thread
From: Michael Roth @ 2022-06-24 13:01 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, Andy Lutomirski, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, mhocko,
	Nikunj A. Dadhania

On Fri, Jun 24, 2022 at 04:54:26PM +0800, Chao Peng wrote:
> On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote:
> > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> > > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > > The alternative would be to have some kind of separate table or bitmap (part
> > > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > > 
> > > > What do you all think?
> > > 
> > > My original proposal was to have expolicit shared vs. private memslots, and punch
> > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> > > handle memslot updates, conversions would be painfully slow.  That's how we ended
> > > up with the current propsoal.
> > > 
> > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> > > and wouldn't necessarily even need to interact with the memslots.  It could be a
> > > consumer of memslots, e.g. if we wanted to disallow registering regions without an
> > > associated memslot, but I think we'd want to avoid even that because things will
> > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> > > memory region is temporarily removed then we wouldn't want to destroy the tracking.
> > > 
> > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> > > should be far more efficient.
> > > 
> > > One benefit to explicitly tracking this in KVM is that it might be useful for
> > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> > > based on guest hypercalls to share/unshare memory, and then complete the transaction
> > > when userspace invokes the ioctl() to complete the share/unshare.
> > 
> > Another upside to implementing a KVM ioctl is basically the reverse of the
> > discussion around avoiding double-allocations: *supporting* double-allocations.
> > 
> > One thing I noticed while testing SNP+UPM support is a fairly dramatic
> > slow-down with how it handles OVMF, which does some really nasty stuff
> > with DMA where it takes 1 or 2 pages and flips them between
> > shared/private on every transaction. Obviously that's not ideal and
> > should be fixed directly at some point, but it's something that exists in the
> > wild and might not be the only such instance where we need to deal with that
> > sort of usage pattern. 
> > 
> > With the current implementation, one option I had to address this was to
> > disable hole-punching in QEMU when doing shared->private conversions:
> > 
> > Boot time from 1GB guest:
> >                                SNP:   32s
> >                            SNP+UPM: 1m43s
> >   SNP+UPM (disable shared discard): 1m08s
> > 
> > Of course, we don't have the option of disabling discard/hole-punching
> > for private memory to see if we get similar gains there, since that also
> > doubles as the interface for doing private->shared conversions.
> 
> Private should be the same, minus time consumed for private memory, the
> data should be close to SNP case. You can't try that in current version
> due to we rely on the existence of the private page to tell a page is
> private.
> 
> > A separate
> > KVM ioctl to decouple these 2 things would allow for that, and allow for a
> > way for userspace to implement things like batched/lazy-discard of
> > previously-converted pages to deal with cases like these.
> 
> The planned ioctl includes two responsibilities:
>   - Mark the range as private/shared
>   - Zap the existing SLPT mapping for the range
> 
> Whether doing the hole-punching or not on the fd is unrelated to this
> ioctl, userspace has freedom to do that or not. Since we don't reply on
> the fact that private memoy should have been allocated, we can support
> lazy faulting and don't need explicit fallocate(). That means, whether
> the memory is discarded or not in the memory backing store is not
> required by KVM, but be a userspace option.

Nice, that sounds promising.

> 
> > 
> > Another motivator for these separate ioctl is that, since we're considering
> > 'out-of-band' interactions with private memfd where userspace might
> > erroneously/inadvertently do things like double allocations, another thing it
> > might do is pre-allocating pages in the private memfd prior to associating
> > the memfd with a private memslot. Since the notifiers aren't registered until
> > that point, any associated callbacks that would normally need to be done as
> > part of those fallocate() notification would be missed unless we do something
> > like 'replay' all the notifications once the private memslot is registered and
> > associating with a memfile notifier. But that seems a bit ugly, and I'm not
> > sure how well that would work. This also seems to hint at this additional
> > 'conversion' state being something that should be owned and managed directly
> > by KVM rather than hooking into the allocations.
> 
> Right, once we move the private/shared state into KVM then we don't rely
> on those callbacks so the 'replay' thing is unneeded. fallocate()
> notification is useless for sure, invalidate() is likely still needed,
> just like the invalidate for mmu_notifier to bump the mmu_seq and do the
> zap.

Ok, yah, makes sense that we'd still up needing the invalidation hooks.

> 
> > 
> > It would also nicely solve the question of how to handle in-place
> > encryption, since unlike userspace, KVM is perfectly capable of copying
> > data from shared->private prior to conversion / guest start, and
> > disallowing such things afterward. Would just need an extra flag basically.
> 
> Agree it's possible to do additional copy during the conversion but I'm
> not so confident this is urgent and the right API. Currently TDX does
> not have this need. Maybe as the first step just add the conversion
> itself. Adding additional feature like this can always be possible
> whenever we are clear.

That seems fair. In the meantime we can adopt the approach proposed by
Sean and Vishal[1] and handle it directly in the relevant SNP KVM ioctls.

If we end up keeping that approach we'll probably want to make sure these
KVM-driven 'implicit' conversions are documented in the KVM/SNP API so that
userspace can account for it in it's view of what's private/shared. In this
case at least it's pretty obvious, just thinking of when other archs and
VMMs utilizing this more.

Thanks!

-Mike

[1] https://lore.kernel.org/kvm/20220524205646.1798325-4-vannapurve@google.com/T/#m1e9bb782b1bea66c36ae7c4c9f4f0c35c2d7e338

> 
> Thanks,
> Chao

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-06-24  9:02     ` Chao Peng
@ 2022-06-30 19:14       ` Vishal Annapurve
  2022-06-30 22:21         ` Michael Roth
  0 siblings, 1 reply; 58+ messages in thread
From: Vishal Annapurve @ 2022-06-30 19:14 UTC (permalink / raw)
  To: Chao Peng
  Cc: Nikunj A. Dadhania, kvm list, LKML, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko

...
> > >     /*
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index afe18d70ece7..e18460e0d743 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2899,6 +2899,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >     if (max_level == PG_LEVEL_4K)
> > >             return PG_LEVEL_4K;
> > >
> > > +   if (kvm_slot_is_private(slot))
> > > +           return max_level;
> >
> > Can you explain the rationale behind the above change?
> > AFAIU, this overrides the transparent_hugepage=never setting for both
> > shared and private mappings.
>
> As Sean pointed out, this should check against fault->is_private instead
> of the slot. For private fault, the level is retrieved and stored to
> fault->max_level in kvm_faultin_pfn_private() instead of here.
>
> For shared fault, it will continue to query host_level below. For
> private fault, the host level has already been accounted in
> kvm_faultin_pfn_private().
>
> Chao
> >

With transparent_hugepages=always setting I see issues with the
current implementation.

Scenario:
1) Guest accesses a gfn range 0x800-0xa00 as private
2) Guest calls mapgpa to convert the range 0x84d-0x86e as shared
3) Guest tries to access recently converted memory as shared for the first time
Guest VM shutdown is observed after step 3 -> Guest is unable to
proceed further since somehow code section is not as expected

Corresponding KVM trace logs after step 3:
VCPU-0-61883   [078] ..... 72276.115679: kvm_page_fault: address
84d000 error_code 4
VCPU-0-61883   [078] ..... 72276.127005: kvm_mmu_spte_requested: gfn
84d pfn 100b4a4d level 2
VCPU-0-61883   [078] ..... 72276.127008: kvm_tdp_mmu_spte_changed: as
id 0 gfn 800 level 2 old_spte 100b1b16827 new_spte 100b4a00ea7
VCPU-0-61883   [078] ..... 72276.127009: kvm_mmu_prepare_zap_page: sp
gen 0 gfn 800 l1 8-byte q0 direct wux nxe ad root 0 sync
VCPU-0-61883   [078] ..... 72276.127009: kvm_tdp_mmu_spte_changed: as
id 0 gfn 800 level 1 old_spte 1003eb27e67 new_spte 5a0
VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
id 0 gfn 801 level 1 old_spte 10056cc8e67 new_spte 5a0
VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
id 0 gfn 802 level 1 old_spte 10056fa2e67 new_spte 5a0
VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
id 0 gfn 803 level 1 old_spte 0 new_spte 5a0
....
 VCPU-0-61883   [078] ..... 72276.127089: kvm_tdp_mmu_spte_changed: as
id 0 gfn 9ff level 1 old_spte 100a43f4e67 new_spte 5a0
 VCPU-0-61883   [078] ..... 72276.127090: kvm_mmu_set_spte: gfn 800
spte 100b4a00ea7 (rwxu) level 2 at 10052fa5020
 VCPU-0-61883   [078] ..... 72276.127091: kvm_fpu: unload

Looks like with transparent huge pages enabled kvm tried to handle the
shared memory fault on 0x84d gfn by coalescing nearby 4K pages
to form a contiguous 2MB page mapping at gfn 0x800, since level 2 was
requested in kvm_mmu_spte_requested.
This caused the private memory contents from regions 0x800-0x84c and
0x86e-0xa00 to get unmapped from the guest leading to guest vm
shutdown.

Does getting the mapping level as per the fault access type help
address the above issue? Any such coalescing should not cross between
private to
shared or shared to private memory regions.

> > >     host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> > >     return min(host_level, max_level);
> > >  }
> >

Regards,
Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-06-30 19:14       ` Vishal Annapurve
@ 2022-06-30 22:21         ` Michael Roth
  2022-07-01  1:21           ` Xiaoyao Li
  0 siblings, 1 reply; 58+ messages in thread
From: Michael Roth @ 2022-06-30 22:21 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Chao Peng, Nikunj A. Dadhania, kvm list, LKML, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Andy Lutomirski, Jun Nakajima, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, mhocko

On Thu, Jun 30, 2022 at 12:14:13PM -0700, Vishal Annapurve wrote:
> With transparent_hugepages=always setting I see issues with the
> current implementation.
> 
> Scenario:
> 1) Guest accesses a gfn range 0x800-0xa00 as private
> 2) Guest calls mapgpa to convert the range 0x84d-0x86e as shared
> 3) Guest tries to access recently converted memory as shared for the first time
> Guest VM shutdown is observed after step 3 -> Guest is unable to
> proceed further since somehow code section is not as expected
> 
> Corresponding KVM trace logs after step 3:
> VCPU-0-61883   [078] ..... 72276.115679: kvm_page_fault: address
> 84d000 error_code 4
> VCPU-0-61883   [078] ..... 72276.127005: kvm_mmu_spte_requested: gfn
> 84d pfn 100b4a4d level 2
> VCPU-0-61883   [078] ..... 72276.127008: kvm_tdp_mmu_spte_changed: as
> id 0 gfn 800 level 2 old_spte 100b1b16827 new_spte 100b4a00ea7
> VCPU-0-61883   [078] ..... 72276.127009: kvm_mmu_prepare_zap_page: sp
> gen 0 gfn 800 l1 8-byte q0 direct wux nxe ad root 0 sync
> VCPU-0-61883   [078] ..... 72276.127009: kvm_tdp_mmu_spte_changed: as
> id 0 gfn 800 level 1 old_spte 1003eb27e67 new_spte 5a0
> VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
> id 0 gfn 801 level 1 old_spte 10056cc8e67 new_spte 5a0
> VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
> id 0 gfn 802 level 1 old_spte 10056fa2e67 new_spte 5a0
> VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
> id 0 gfn 803 level 1 old_spte 0 new_spte 5a0
> ....
>  VCPU-0-61883   [078] ..... 72276.127089: kvm_tdp_mmu_spte_changed: as
> id 0 gfn 9ff level 1 old_spte 100a43f4e67 new_spte 5a0
>  VCPU-0-61883   [078] ..... 72276.127090: kvm_mmu_set_spte: gfn 800
> spte 100b4a00ea7 (rwxu) level 2 at 10052fa5020
>  VCPU-0-61883   [078] ..... 72276.127091: kvm_fpu: unload
> 
> Looks like with transparent huge pages enabled kvm tried to handle the
> shared memory fault on 0x84d gfn by coalescing nearby 4K pages
> to form a contiguous 2MB page mapping at gfn 0x800, since level 2 was
> requested in kvm_mmu_spte_requested.
> This caused the private memory contents from regions 0x800-0x84c and
> 0x86e-0xa00 to get unmapped from the guest leading to guest vm
> shutdown.

Interesting... seems like that wouldn't be an issue for non-UPM SEV, since
the private pages would still be mapped as part of that 2M mapping, and
it's completely up to the guest as to whether it wants to access as
private or shared. But for UPM it makes sense this would cause issues.

> 
> Does getting the mapping level as per the fault access type help
> address the above issue? Any such coalescing should not cross between
> private to
> shared or shared to private memory regions.

Doesn't seem like changing the check to fault->is_private would help in
your particular case, since the subsequent host_pfn_mapping_level() call
only seems to limit the mapping level to whatever the mapping level is
for the HVA in the host page table.

Seems like with UPM we need some additional handling here that also
checks that the entire 2M HVA range is backed by non-private memory.

Non-UPM SNP hypervisor patches already have a similar hook added to
host_pfn_mapping_level() which implements such a check via RMP table, so
UPM might need something similar:

  https://github.com/AMDESE/linux/commit/ae4475bc740eb0b9d031a76412b0117339794139

-Mike

> 
> > > >     host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> > > >     return min(host_level, max_level);
> > > >  }
> > >
> 
> Regards,
> Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-06-30 22:21         ` Michael Roth
@ 2022-07-01  1:21           ` Xiaoyao Li
  2022-07-07 20:08             ` Sean Christopherson
  0 siblings, 1 reply; 58+ messages in thread
From: Xiaoyao Li @ 2022-07-01  1:21 UTC (permalink / raw)
  To: Michael Roth, Vishal Annapurve
  Cc: Chao Peng, Nikunj A. Dadhania, kvm list, LKML, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Andy Lutomirski, Jun Nakajima, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, mhocko

On 7/1/2022 6:21 AM, Michael Roth wrote:
> On Thu, Jun 30, 2022 at 12:14:13PM -0700, Vishal Annapurve wrote:
>> With transparent_hugepages=always setting I see issues with the
>> current implementation.
>>
>> Scenario:
>> 1) Guest accesses a gfn range 0x800-0xa00 as private
>> 2) Guest calls mapgpa to convert the range 0x84d-0x86e as shared
>> 3) Guest tries to access recently converted memory as shared for the first time
>> Guest VM shutdown is observed after step 3 -> Guest is unable to
>> proceed further since somehow code section is not as expected
>>
>> Corresponding KVM trace logs after step 3:
>> VCPU-0-61883   [078] ..... 72276.115679: kvm_page_fault: address
>> 84d000 error_code 4
>> VCPU-0-61883   [078] ..... 72276.127005: kvm_mmu_spte_requested: gfn
>> 84d pfn 100b4a4d level 2
>> VCPU-0-61883   [078] ..... 72276.127008: kvm_tdp_mmu_spte_changed: as
>> id 0 gfn 800 level 2 old_spte 100b1b16827 new_spte 100b4a00ea7
>> VCPU-0-61883   [078] ..... 72276.127009: kvm_mmu_prepare_zap_page: sp
>> gen 0 gfn 800 l1 8-byte q0 direct wux nxe ad root 0 sync
>> VCPU-0-61883   [078] ..... 72276.127009: kvm_tdp_mmu_spte_changed: as
>> id 0 gfn 800 level 1 old_spte 1003eb27e67 new_spte 5a0
>> VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
>> id 0 gfn 801 level 1 old_spte 10056cc8e67 new_spte 5a0
>> VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
>> id 0 gfn 802 level 1 old_spte 10056fa2e67 new_spte 5a0
>> VCPU-0-61883   [078] ..... 72276.127010: kvm_tdp_mmu_spte_changed: as
>> id 0 gfn 803 level 1 old_spte 0 new_spte 5a0
>> ....
>>   VCPU-0-61883   [078] ..... 72276.127089: kvm_tdp_mmu_spte_changed: as
>> id 0 gfn 9ff level 1 old_spte 100a43f4e67 new_spte 5a0
>>   VCPU-0-61883   [078] ..... 72276.127090: kvm_mmu_set_spte: gfn 800
>> spte 100b4a00ea7 (rwxu) level 2 at 10052fa5020
>>   VCPU-0-61883   [078] ..... 72276.127091: kvm_fpu: unload
>>
>> Looks like with transparent huge pages enabled kvm tried to handle the
>> shared memory fault on 0x84d gfn by coalescing nearby 4K pages
>> to form a contiguous 2MB page mapping at gfn 0x800, since level 2 was
>> requested in kvm_mmu_spte_requested.
>> This caused the private memory contents from regions 0x800-0x84c and
>> 0x86e-0xa00 to get unmapped from the guest leading to guest vm
>> shutdown.
> 
> Interesting... seems like that wouldn't be an issue for non-UPM SEV, since
> the private pages would still be mapped as part of that 2M mapping, and
> it's completely up to the guest as to whether it wants to access as
> private or shared. But for UPM it makes sense this would cause issues.
> 
>>
>> Does getting the mapping level as per the fault access type help
>> address the above issue? Any such coalescing should not cross between
>> private to
>> shared or shared to private memory regions.
> 
> Doesn't seem like changing the check to fault->is_private would help in
> your particular case, since the subsequent host_pfn_mapping_level() call
> only seems to limit the mapping level to whatever the mapping level is
> for the HVA in the host page table.
> 
> Seems like with UPM we need some additional handling here that also
> checks that the entire 2M HVA range is backed by non-private memory.
> 
> Non-UPM SNP hypervisor patches already have a similar hook added to
> host_pfn_mapping_level() which implements such a check via RMP table, so
> UPM might need something similar:
> 
>    https://github.com/AMDESE/linux/commit/ae4475bc740eb0b9d031a76412b0117339794139
> 
> -Mike
> 

For TDX, we try to track the page type (shared, private, mixed) of each 
gfn at given level. Only when the type is shared/private, can it be 
mapped at that level. When it's mixed, i.e., it contains both shared 
pages and private pages at given level, it has to go to next smaller level.

https://github.com/intel/tdx/commit/ed97f4042eb69a210d9e972ccca6a84234028cad



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-07-01  1:21           ` Xiaoyao Li
@ 2022-07-07 20:08             ` Sean Christopherson
  2022-07-08  3:29               ` Xiaoyao Li
  0 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-07-07 20:08 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Michael Roth, Vishal Annapurve, Chao Peng, Nikunj A. Dadhania,
	kvm list, LKML, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Andy Lutomirski, Jun Nakajima, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, mhocko

On Fri, Jul 01, 2022, Xiaoyao Li wrote:
> On 7/1/2022 6:21 AM, Michael Roth wrote:
> > On Thu, Jun 30, 2022 at 12:14:13PM -0700, Vishal Annapurve wrote:
> > > With transparent_hugepages=always setting I see issues with the
> > > current implementation.

...

> > > Looks like with transparent huge pages enabled kvm tried to handle the
> > > shared memory fault on 0x84d gfn by coalescing nearby 4K pages
> > > to form a contiguous 2MB page mapping at gfn 0x800, since level 2 was
> > > requested in kvm_mmu_spte_requested.
> > > This caused the private memory contents from regions 0x800-0x84c and
> > > 0x86e-0xa00 to get unmapped from the guest leading to guest vm
> > > shutdown.
> > 
> > Interesting... seems like that wouldn't be an issue for non-UPM SEV, since
> > the private pages would still be mapped as part of that 2M mapping, and
> > it's completely up to the guest as to whether it wants to access as
> > private or shared. But for UPM it makes sense this would cause issues.
> > 
> > > 
> > > Does getting the mapping level as per the fault access type help
> > > address the above issue? Any such coalescing should not cross between
> > > private to
> > > shared or shared to private memory regions.
> > 
> > Doesn't seem like changing the check to fault->is_private would help in
> > your particular case, since the subsequent host_pfn_mapping_level() call
> > only seems to limit the mapping level to whatever the mapping level is
> > for the HVA in the host page table.
> > 
> > Seems like with UPM we need some additional handling here that also
> > checks that the entire 2M HVA range is backed by non-private memory.
> > 
> > Non-UPM SNP hypervisor patches already have a similar hook added to
> > host_pfn_mapping_level() which implements such a check via RMP table, so
> > UPM might need something similar:
> > 
> >    https://github.com/AMDESE/linux/commit/ae4475bc740eb0b9d031a76412b0117339794139
> > 
> > -Mike
> > 
> 
> For TDX, we try to track the page type (shared, private, mixed) of each gfn
> at given level. Only when the type is shared/private, can it be mapped at
> that level. When it's mixed, i.e., it contains both shared pages and private
> pages at given level, it has to go to next smaller level.
> 
> https://github.com/intel/tdx/commit/ed97f4042eb69a210d9e972ccca6a84234028cad

Hmm, so a new slot->arch.page_attr array shouldn't be necessary, KVM can instead
update slot->arch.lpage_info on shared<->private conversions.  Detecting whether
a given range is partially mapped could get nasty if KVM defers tracking to the
backing store, but if KVM itself does the tracking as was previously suggested[*],
then updating lpage_info should be relatively straightfoward, e.g. use
xa_for_each_range() to see if a given 2mb/1gb range is completely covered (fully
shared) or not covered at all (fully private).

[*] https://lore.kernel.org/all/YofeZps9YXgtP3f1@google.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-07-07 20:08             ` Sean Christopherson
@ 2022-07-08  3:29               ` Xiaoyao Li
  2022-07-20 23:08                 ` Vishal Annapurve
  0 siblings, 1 reply; 58+ messages in thread
From: Xiaoyao Li @ 2022-07-08  3:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, Vishal Annapurve, Chao Peng, Nikunj A. Dadhania,
	kvm list, LKML, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Andy Lutomirski, Jun Nakajima, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, mhocko

On 7/8/2022 4:08 AM, Sean Christopherson wrote:
> On Fri, Jul 01, 2022, Xiaoyao Li wrote:
>> On 7/1/2022 6:21 AM, Michael Roth wrote:
>>> On Thu, Jun 30, 2022 at 12:14:13PM -0700, Vishal Annapurve wrote:
>>>> With transparent_hugepages=always setting I see issues with the
>>>> current implementation.
> 
> ...
> 
>>>> Looks like with transparent huge pages enabled kvm tried to handle the
>>>> shared memory fault on 0x84d gfn by coalescing nearby 4K pages
>>>> to form a contiguous 2MB page mapping at gfn 0x800, since level 2 was
>>>> requested in kvm_mmu_spte_requested.
>>>> This caused the private memory contents from regions 0x800-0x84c and
>>>> 0x86e-0xa00 to get unmapped from the guest leading to guest vm
>>>> shutdown.
>>>
>>> Interesting... seems like that wouldn't be an issue for non-UPM SEV, since
>>> the private pages would still be mapped as part of that 2M mapping, and
>>> it's completely up to the guest as to whether it wants to access as
>>> private or shared. But for UPM it makes sense this would cause issues.
>>>
>>>>
>>>> Does getting the mapping level as per the fault access type help
>>>> address the above issue? Any such coalescing should not cross between
>>>> private to
>>>> shared or shared to private memory regions.
>>>
>>> Doesn't seem like changing the check to fault->is_private would help in
>>> your particular case, since the subsequent host_pfn_mapping_level() call
>>> only seems to limit the mapping level to whatever the mapping level is
>>> for the HVA in the host page table.
>>>
>>> Seems like with UPM we need some additional handling here that also
>>> checks that the entire 2M HVA range is backed by non-private memory.
>>>
>>> Non-UPM SNP hypervisor patches already have a similar hook added to
>>> host_pfn_mapping_level() which implements such a check via RMP table, so
>>> UPM might need something similar:
>>>
>>>     https://github.com/AMDESE/linux/commit/ae4475bc740eb0b9d031a76412b0117339794139
>>>
>>> -Mike
>>>
>>
>> For TDX, we try to track the page type (shared, private, mixed) of each gfn
>> at given level. Only when the type is shared/private, can it be mapped at
>> that level. When it's mixed, i.e., it contains both shared pages and private
>> pages at given level, it has to go to next smaller level.
>>
>> https://github.com/intel/tdx/commit/ed97f4042eb69a210d9e972ccca6a84234028cad
> 
> Hmm, so a new slot->arch.page_attr array shouldn't be necessary, KVM can instead
> update slot->arch.lpage_info on shared<->private conversions.  Detecting whether
> a given range is partially mapped could get nasty if KVM defers tracking to the
> backing store, but if KVM itself does the tracking as was previously suggested[*],
> then updating lpage_info should be relatively straightfoward, e.g. use
> xa_for_each_range() to see if a given 2mb/1gb range is completely covered (fully
> shared) or not covered at all (fully private).
> 
> [*] https://lore.kernel.org/all/YofeZps9YXgtP3f1@google.com

Yes, slot->arch.page_attr was introduced to help identify whether a page 
is completely shared/private at given level. It seems XARRAY can serve 
the same purpose, though I know nothing about it. Looking forward to 
seeing the patch of using XARRAY.

yes, update slot->arch.lpage_info is good to utilize the existing logic 
and Isaku has applied it to slot->arch.lpage_info for 2MB support patches.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-07-08  3:29               ` Xiaoyao Li
@ 2022-07-20 23:08                 ` Vishal Annapurve
  2022-07-21  9:45                   ` Chao Peng
  0 siblings, 1 reply; 58+ messages in thread
From: Vishal Annapurve @ 2022-07-20 23:08 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Sean Christopherson, Michael Roth, Chao Peng, Nikunj A. Dadhania,
	kvm list, LKML, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Andy Lutomirski, Jun Nakajima, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, mhocko

> > Hmm, so a new slot->arch.page_attr array shouldn't be necessary, KVM can instead
> > update slot->arch.lpage_info on shared<->private conversions.  Detecting whether
> > a given range is partially mapped could get nasty if KVM defers tracking to the
> > backing store, but if KVM itself does the tracking as was previously suggested[*],
> > then updating lpage_info should be relatively straightfoward, e.g. use
> > xa_for_each_range() to see if a given 2mb/1gb range is completely covered (fully
> > shared) or not covered at all (fully private).
> >
> > [*] https://lore.kernel.org/all/YofeZps9YXgtP3f1@google.com
>
> Yes, slot->arch.page_attr was introduced to help identify whether a page
> is completely shared/private at given level. It seems XARRAY can serve
> the same purpose, though I know nothing about it. Looking forward to
> seeing the patch of using XARRAY.
>
> yes, update slot->arch.lpage_info is good to utilize the existing logic
> and Isaku has applied it to slot->arch.lpage_info for 2MB support patches.

Chao, are you planning to implement these changes to ensure proper
handling of hugepages partially mapped as private/shared in subsequent
versions of this series?
Or is this something left to be handled by the architecture specific code?

Regards,
Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-07-20 23:08                 ` Vishal Annapurve
@ 2022-07-21  9:45                   ` Chao Peng
  0 siblings, 0 replies; 58+ messages in thread
From: Chao Peng @ 2022-07-21  9:45 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Xiaoyao Li, Sean Christopherson, Michael Roth,
	Nikunj A. Dadhania, kvm list, LKML, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, mhocko

On Wed, Jul 20, 2022 at 04:08:10PM -0700, Vishal Annapurve wrote:
> > > Hmm, so a new slot->arch.page_attr array shouldn't be necessary, KVM can instead
> > > update slot->arch.lpage_info on shared<->private conversions.  Detecting whether
> > > a given range is partially mapped could get nasty if KVM defers tracking to the
> > > backing store, but if KVM itself does the tracking as was previously suggested[*],
> > > then updating lpage_info should be relatively straightfoward, e.g. use
> > > xa_for_each_range() to see if a given 2mb/1gb range is completely covered (fully
> > > shared) or not covered at all (fully private).
> > >
> > > [*] https://lore.kernel.org/all/YofeZps9YXgtP3f1@google.com
> >
> > Yes, slot->arch.page_attr was introduced to help identify whether a page
> > is completely shared/private at given level. It seems XARRAY can serve
> > the same purpose, though I know nothing about it. Looking forward to
> > seeing the patch of using XARRAY.
> >
> > yes, update slot->arch.lpage_info is good to utilize the existing logic
> > and Isaku has applied it to slot->arch.lpage_info for 2MB support patches.
> 
> Chao, are you planning to implement these changes to ensure proper
> handling of hugepages partially mapped as private/shared in subsequent
> versions of this series?
> Or is this something left to be handled by the architecture specific code?

Ah, the topic gets moved to a different place. I should update here.
There were more discussions under TDX KVM patch series and I actually
just sent out the draft code for this:

https://lkml.org/lkml/2022/7/20/610

That patch is based on UPM v7 here. If I can get more feedbacks there
then I will include an udpated version in UPM v8.

If you have bandwdith, you can also play with that patch, any feedback
is welcome.

Chao
> 
> Regards,
> Vishal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-06-17 21:30   ` Sean Christopherson
  2022-06-20 14:16     ` Chao Peng
@ 2022-08-19  0:40     ` Kirill A. Shutemov
  2022-08-25 23:43       ` Sean Christopherson
  1 sibling, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2022-08-19  0:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, Jun 17, 2022 at 09:30:53PM +0000, Sean Christopherson wrote:
> > @@ -4088,7 +4144,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  		read_unlock(&vcpu->kvm->mmu_lock);
> >  	else
> >  		write_unlock(&vcpu->kvm->mmu_lock);
> > -	kvm_release_pfn_clean(fault->pfn);
> > +
> > +	if (fault->is_private)
> > +		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> 
> Why does the shmem path lock the page, and then unlock it here?

Lock is require to avoid race with truncate / punch hole. Like if truncate
happens after get_pfn(), but before it gets into SEPT we are screwed.

> Same question for why this path marks it dirty?  The guest has the page mapped
> so the dirty flag is immediately stale.

If page is clean and refcount is not elevated, vmscan is free to drop the
page from page cache. I don't think we want this.

> In other words, why does KVM need to do something different for private pfns?

Because in the traditional KVM memslot scheme, core mm takes care about
this.

The changes in v7 is wrong. Page has be locked until it lends into SEPT and
must make it dirty before unlocking.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v6 6/8] KVM: Handle page fault for private memory
  2022-08-19  0:40     ` Kirill A. Shutemov
@ 2022-08-25 23:43       ` Sean Christopherson
  0 siblings, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2022-08-25 23:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko

On Fri, Aug 19, 2022, Kirill A. Shutemov wrote:
> On Fri, Jun 17, 2022 at 09:30:53PM +0000, Sean Christopherson wrote:
> > > @@ -4088,7 +4144,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > >  		read_unlock(&vcpu->kvm->mmu_lock);
> > >  	else
> > >  		write_unlock(&vcpu->kvm->mmu_lock);
> > > -	kvm_release_pfn_clean(fault->pfn);
> > > +
> > > +	if (fault->is_private)
> > > +		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> > 
> > Why does the shmem path lock the page, and then unlock it here?
> 
> Lock is require to avoid race with truncate / punch hole. Like if truncate
> happens after get_pfn(), but before it gets into SEPT we are screwed.

Getting the PFN into the SPTE doesn't provide protection in and of itself.  The
protection against truncation and whatnot comes from KVM getting a notification
and either retrying the fault (notification acquires mmu_lock before
direct_page_fault()), or blocking the notification (truncate / punch hole) until
after KVM installs the SPTE.  I.e. KVM just needs to ensure it doesn't install a
SPTE _after_ getting notified.

If the API is similar to gup(), i.e. only elevates the refcount but doesn't lock
the page, then there's no need for a separate kvm_private_mem_put_pfn(), and in
fact no need for ->put_unlock_pfn() because can KVM do set_page_dirty() and
put_page() directly as needed using all of KVM's existing mechanisms.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2022-08-25 23:43 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
2022-05-19 15:37 ` [PATCH v6 1/8] mm: Introduce memfile_notifier Chao Peng
2022-05-19 15:37 ` [PATCH v6 2/8] mm/shmem: Support memfile_notifier Chao Peng
2022-05-19 15:37 ` [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
2022-05-31 19:15   ` Vishal Annapurve
2022-06-01 10:17     ` Chao Peng
2022-06-01 12:11       ` Gupta, Pankaj
2022-06-02 10:07         ` Chao Peng
2022-06-14 20:23           ` Sean Christopherson
2022-06-15  8:53             ` Chao Peng
2022-05-19 15:37 ` [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-05-20 17:57   ` Andy Lutomirski
2022-05-20 18:31     ` Sean Christopherson
2022-05-22  4:03       ` Andy Lutomirski
2022-05-23 13:21       ` Chao Peng
2022-05-23 15:22         ` Sean Christopherson
2022-05-30 13:26           ` Chao Peng
2022-06-10 16:14             ` Sean Christopherson
2022-06-14  6:45               ` Chao Peng
2022-06-23 22:59       ` Michael Roth
2022-06-24  8:54         ` Chao Peng
2022-06-24 13:01           ` Michael Roth
2022-06-17 20:52   ` Sean Christopherson
2022-06-17 21:27     ` Sean Christopherson
2022-06-20 14:09       ` Chao Peng
2022-06-20 14:08     ` Chao Peng
2022-05-19 15:37 ` [PATCH v6 5/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-05-19 15:37 ` [PATCH v6 6/8] KVM: Handle page fault for private memory Chao Peng
2022-06-17 21:30   ` Sean Christopherson
2022-06-20 14:16     ` Chao Peng
2022-08-19  0:40     ` Kirill A. Shutemov
2022-08-25 23:43       ` Sean Christopherson
2022-06-24  3:58   ` Nikunj A. Dadhania
2022-06-24  9:02     ` Chao Peng
2022-06-30 19:14       ` Vishal Annapurve
2022-06-30 22:21         ` Michael Roth
2022-07-01  1:21           ` Xiaoyao Li
2022-07-07 20:08             ` Sean Christopherson
2022-07-08  3:29               ` Xiaoyao Li
2022-07-20 23:08                 ` Vishal Annapurve
2022-07-21  9:45                   ` Chao Peng
2022-05-19 15:37 ` [PATCH v6 7/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-06-23 22:07   ` Michael Roth
2022-06-24  8:43     ` Chao Peng
2022-05-19 15:37 ` [PATCH v6 8/8] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
2022-06-06 20:09 ` [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Vishal Annapurve
2022-06-07  6:57   ` Chao Peng
2022-06-08  0:55     ` Marc Orr
2022-06-08  2:18       ` Chao Peng
2022-06-08 19:37         ` Vishal Annapurve
2022-06-09 20:29           ` Sean Christopherson
2022-06-14  7:28             ` Chao Peng
2022-06-14 17:37               ` Andy Lutomirski
2022-06-14 19:08                 ` Sean Christopherson
2022-06-14 20:59                   ` Andy Lutomirski
2022-06-15  9:17                     ` Chao Peng
2022-06-15 14:29                       ` Sean Christopherson
2022-06-10  0:11         ` Marc Orr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.